Hacker News new | past | comments | ask | show | jobs | submit login
ArchiveBox/ArchiveBox: open-source self-hosted web archiving (github.com/archivebox)
201 points by rcarmo on July 16, 2021 | hide | past | favorite | 25 comments



I really need to try this out soon. I keep bumping into it online. What do people on HN use it for?

What I do right now "to collect, save, and view sites you want to preserve offline" is by use of a Firefox plugin called WebScrapBook. Click-click-done, and I have a local searchable (!) copy of a webpage exactly as it looked in the browser. With styles and all, in one file. WebScrapBook is pretty highly configurable.

In the future I would like to have a solution that doesn't require some Firefox plugin.


I’ve used it to save a lot of pages related to ham radio. I have several 30-40 year old radios and I’m afraid one day information about them will just drop off the internet.


Cool! How is the saving experience? Is it quick? And looking up? Or is it more just 'saving for later use'?


Fairly quick, depending of course on the weight of the page(s) and how many archive methods you enable (I have several turned off because they are redundant for my purposes). The "adding to archive" interstitial still tends to time out which is a little annoying, but the actual archive process is backgrounded so it doesn't matter.

The recommended install includes a search engine that works well, aside from a few false positives due to being fuzzy search. I don't have much in it yet so I can't say what the performance is like once you reach e.g. thousands of pages, but I imagine it would still perform well except maybe for mass operations like rebuilding the entire index.


Consider having archive.org save the websites, in case they are not already.


I think it would be neat to try integrating ArchiveBox with the enhanced history / browsing context extension Promnesia [1].

[1]: https://github.com/karlicoss/promnesia


How does this compare with, say, Wallabag? https://github.com/wallabag

Looks like ArchiveBox has more export options? EDIT: looks like ArchiveBox is focused on continuous change tracking rather than than just snapshots like Wallabag.


Actually, no. You can take more snapshots, but that's just an added feature.


Just FYI, I have this set up as a Docker container on my Synology and it is now patiently crawling through my (imported) 2000+ Pocket URLs, to which I’m adding a lot of other stuff scattered across other “clipping” tools (like OneNote).

Key benefit for me is having actual local files. The resulting PDFs are searchable on their own, so I can sync those back to my Mac for reference (and Spotlight indexing). But the HTML snapshots are also pretty decent.

One thing I’ll be looking into is automatic tagging (since it’s a Django app there are plenty of likely ways to inject that info).


I just got my first Synology literally two days ago, a DS3617xsII. Looking forward to playing with it, especially the virtualization / Docker features. How do you like it?


Very much so, this is my second and I've had it for a year now: https://taoofmac.com/space/blog/2020/04/04/2310


Cool post. Sounds like it has a lot of cool features, looking forward to messing with it over the weekend.


Wow, it can even extract video! This looks phenomenal, and a great excuse to stock up on more disk space.

Their roadmap is also very interesting: "v2.0 Federated or distributed archiving + paid hosted service offering"

https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap#v20-fe...


i have been archiving cool or interesting sites with this tool for about a year now. fantastic. can't say enough good things about it.


Thanks for posting this @rcarmo!

One neat tiny implementation detail of ArchiveBox that I just highlighted on HN today is our use of asymptotic progress bars when we don't know how long archiving a page is going to take: https://news.ycombinator.com/item?id=27860022


When I tried to add my bookmarks from a file, one of the website seems to not be online anymore; archivebox just stops execution when it encounters this issue.

How can I make archivebox ignore such errors and continue with the rest of the websites?

Command used:

  archivebox add < exported_bookmarks.html --depth=1


It should not stop execution, can you open an issue with your log output or a screenshot.

Also note you got the arguments backwards, make sure to put the file redirect at the end, after --depth=1.

It's also possible that was the last URL in your list and it stopped at the end. If you want to re-try archiving you should run this instead:

    archivebox update


What is the advantage of this over something like Kiwix, or just using Playwright CLI? It seems useful but a bit unnecessary if just using to create Archive.org links.


I use it like permanent bookmarks. I can go back to it and trust it'll still be there. And this isn't just "things disappear eventually" - for a specific example, I was working on something rather last-minute recently, and wanted to refer to a vendor's whitepaper - and their entire site was "down for maintenance" all weekend. So I went onto my archive and I still had a copy from my first pass over the topic. It's a free resource, I can't blame them and I can't complain - but if I can make sure it doesn't impact me, even better.

I know other people would still have that tab open from 3 weeks ago, but I just don't work like that.

I'm not going to complain about wayback/archive.org at all, but the nature of the beast is that there's certain requests they have to obey - and with my own offline, non-exposed equivalent, I don't (well, I do, but I simply don't receive them)


> I use it like permanent bookmarks.

I create browser bookmarks regularly but I wouldn't be bothered to SSH into my server to also tell it to grab a copy of the URL. Automating this with a browser plugin would be cool.


There is a web UI, so I have a JS "bookmarklet" in my toolbar. If you go to the web UI and Add a url, right at the bottom of the page there's a link you can bookmark to do the same. So that's my workflow - click Archive (or cmd-alt-1), then hit enter. Done.


Aha, that's great! I might give this a go this weekend then.


There's also a real browser extension in the works by one of our users: https://github.com/ArchiveBox/ArchiveBox/issues/577#issuecom...


This does _a lot more_ than just creating archive.org links. It saves the entire page contents (in multiple formats, including nicely searchable PDF and some embedded media) locally.


Archivebox is also an awesome tool to create copies of a website. Whether you want to demonstrate a phishing attack or do a POC to integrate your product.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: