Hacker News new | past | comments | ask | show | jobs | submit login
SingleFile: Save a complete web page into a single HTML file (github.com/gildas-lormeau)
958 points by crbelaus on March 2, 2022 | hide | past | favorite | 240 comments



Author here, it makes me really happy to see SIngleFile on the front page of HN. Thank you! I take the opportunity to make you aware of the upcoming impacts of the Manifest V3 [1], and for those who prefer zip files, I recommend you to have a look here [2].

[1] https://github.com/gildas-lormeau/SingleFile-Lite

[2] https://github.com/gildas-lormeau/SingleFileZ


Thanks for this project. I found SingleFile a year or two ago, and used it to take "HTML Screenshots" of third party sites I could embed in guided walkthroughs with modified/example data changed, instead of just PNGs.

SingleFile was ultra-valuable for this.

If anyone has a similar use-case, I wrote some pretty rough (and slow) code to post-process SingleFile's output to remove any HTML that wasn't contributing to the presentational render by launching puppeteer and comparing pixels. It's available here: https://github.com/mieko/trailcap


It's interesting! I had started something similar as part of testing but hadn't really finished my work. I will have a look at your project.


One very useful thing you could add to this (if you feel like it) would be to make it work with snapshotted directories, rather than a single HTML file with inlined data. You can get the former with SingleFileZ and then extracting the resulting zip file.

I like these because it makes it easier for me to make manual edits when necessary and it's a better solution for long term archiving (IMO). But I would love to add your project to my workflow.


Oh, that’s such a good idea for documentation - thanks for sharing!


Single File is one of my favorite addons since it allows me to keep offline copies of articles, tutorials, etc i see online without losing images, etc (there have been a ton of articles lost over the years and while some are preserved in archive.org, they often lack things like images, etc, so i prefer to save anything i come across). So thank you for making it :-).

Now, having said that, the text in SingleFile-Lite's "Notable features of SingleFile Lite" sound like a list of issues :-P. It looks like these are issues with Chrome, but do you know if/how these "improvements" will affect Firefox?


AFAIK, for the moment Mozilla is aware of the regressions that Manifest V3 causes and shows a good will to try to reduce them as much as possible. You can find some information about this here https://github.com/w3c/webextensions/tree/main/_minutes


I use archivebox(Django-based) for keeping offline copies (which uses SingleFile and few other libs in backend) https://github.com/ArchiveBox/


You might be interested in https://github.com/wabarc/wayback, a chat-based archiving tool.


Thank you for the Manifest V3 critique, the examples you give make it really clear how many things are regressing with this upcoming change :/


FYI: Figure tags don't convert their hrefs to base64.

For example, try saving my home page: https://andrewrondeau.herokuapp.com/

The img tags are converted correctly, but there's still <figure class=image><a href="https://andrewrondeau.herokuapp.com/... in the single HTML file.


I cannot reproduce your issue, I just did a test on this page and I see the expected `<img src="data:image/jpeg;base64,...` in the saved page.


Look for figure tags, not image tags

Even better, just search for ="http


All I can find is a <a> tag.

Edit: I think I now understand the issue. I confirm SingleFile saves only the current page and not linked images, for example.


I've been using SingleFile for the last year or so, it's amazing!

I'm going to hijack your post for a question! I love the way you can use the editor and select "format for better readability," then save just the stripped down version of the page. I use this to send it to my e-ink device.

The question I have is whether it's possible to toggle the default save to use the formatted version automatically? I dug into the options and didn't turn anything up!


You can enable these options for this:

- Annotation editor > default mode > edit the page

- Annotation editor > annotate the page before saving


Sorry, I was wrong, you have to select "format the page" instead of "edit the page" (first item).


Thank you for your work! I've been using start.me for my new tab page (it's the page you see every single time you open a tab, I can't believe most people don't make it useful), but it's way too slow so I SingleFile it and have a local Firefox extension to set it as my new tab page.

It's too complicated but at least it works!


Is it possible to use this within the context of the current web page, without the extension portion?

Taking a snapshot of my user's screen and then display it to them later (maybe in an iFrame)?


It's possible but it's a bit limited. It won't be able for example to save images coming from a different origin.


Thanks for the work you have done, its a lazy man heaven especially for bulk downloads and helped me a lot. About a month ago I have decided to backup my bookmarks via archivebox, it was more than 1k bookmarks, most reliable methods were singlefile and wget.


Well done, gildas! You have really built a quality product and marketed your work well.


I love it and use it. But product? Nothing is being sold and product sounds almost condescending. :)


> But product? Nothing is being sold and product sounds almost condescending. :)

Really!? You don't think it's trying to be condescending to try to find criticism on someone's praise, haha? Heh, aanyway :) You feel product is bad!? So weird!! I guess you find there what you bring to it. Wonder what you're protecting there, if you share more of your thinking, we can do know you more. Even so, I think we can just celebrate gildas' achievement! :)


Thank you Cris!


Twelve year project with nearly 7000 commits shows a lot of dedication. Good work.


If I start using SingleFile today, will I still be open saved pages after the update to Manifest V3?

I mean, if I want to save pages over the next 11 months, should I install SinglePage or SinglePage-lite?


In fact, you simply do not need an extension to open pages saved with SingleFile (or SingleFile Lite) because they are standard HTML pages. So you don't have to worry about that.


This alone is fantastic. I've been looking for an mhtml replacement that worked well across all browsers.


Is there a configuration for the zip version where I can avoid duplicating the static assets? Thanks


I guess you're referring to SingleFileZ. This option is not needed because zip files (i.e. what SingleFileZ produces) already provide this feature.


Very nice! Will use it for sure. May I ask you how you created that good looking demo gif?


I used:

- ScreenToGif to record video sequences and produce the final GIF: https://www.screentogif.com/

- Macro Recorder to record and replay user navigation: https://www.macrorecorder.com/

- Blender to edit the video, add text comments, and make the intro: https://www.blender.org/


Thank you for sharing!


Thank you, very useful and works like a charm: a must have.


FYI - there’s an official standard (MHTML) for doing this that has existed for 20+ years and exists natively in browsers.

https://en.m.wikipedia.org/wiki/MHTML


> FYI

The alternative format (used by the Internet Archive and Wayback Machine) is WARC. It's also a single file, but it's preserving the HTTP headers as well; so its applications is specifically for archival purposes. [1] The "wget" tool which is co-maintained by the Web Archive people also has support for it via CLI flags.

Though when it comes to mobile browser support I'd recommend to use MHTML, because webkit and chromium both have support for it upstream.

[1] http://iipc.github.io/warc-specifications/

[2] https://www.gnu.org/software/wget/wget.html


WARC is also used by the Webrecorder project. They made an app called Wabac which does entirely client-side WARC or HAR replays using service workers and it seems to have pretty good browser support, but I haven't really dug into the specifics.

https://github.com/webrecorder/wabac.js-1.0


There is a project that uses a headless browser to implement HAR.

https://github.com/wabarc/screenshot


Is there any objection to adding WARC support to webkit/chromium? Seems like a not-so-complex project...


I know that WebKit relies on either libsoup [1] (on Linux/Unices) or curl [2] (legacy Windows and maybe WPE(?)) as a network adapter, so the header handling and parsing mechanisms would have to be implemented in there.

Though, on MacOS, WebKit tries to migrate most APIs to the Core Foundation Framework, which makes it kind of impossible to implement as a non-Apple-employee because it's basically a dump-it-and-never-care Open Source approach. [3]

Don't know about chromium (my knowledge is ~2012ish about their architecture, and pre-Blink).

[1] https://github.com/WebKit/WebKit/tree/main/Source/WebKit/Net...

[2] https://github.com/WebKit/WebKit/tree/main/Source/WebKit/Net...

[3] https://github.com/opensource-apple/CF


GTK/WPE use libsoup. Playstation/Windows uses curl. And yes Apples networking is proprietary.


I wasn't sure about WPE in regards to libsoup due to the glib dependencies and all the InjectedBundle hacks that I thought they wanted to avoid.

I mean, in principal curl would run on the other platforms, too...but as far as I can tell there's an initiative to move as much as possible to the CF framework (strings, memory allocation, https and tls, sockets etc) and away from the cross-platform implementations.


Over a decade ago I had a laptop but no internet at home. This was one of the ways I taught myself programming (and also downloading dozens of manga) by using internet explorer at a cafe which had an option to save to mhtml which was one file and had everything self contained. Legit owe a portion of my success to this. I still have some of these files, old crusty hello world c++ tutorials etc.


I have fantastic internet, and I still do something similar. Local docs just load so much faster, and if something happens (which it still does, even on Fiber in the US), I have docs and can program.

Lemme see if I can pull up the command I use to mirror doc sites.

    wget \
      --recursive \
      --level=5 \
      --convert-links \
      --page-requisites \
      --wait=1 \
      --random-wait \
      --timestamping \
      --no-parent \
      $1


For people who cannot afford internet access now, and for perhaps more in the future if times get more difficult, I believe this is a very important use-case.


The Chrome engineer who maintains the MTHML work wrote up a comprehensive doc on the modifications on the MHTML spec (RFC 2557) that are implemented: https://docs.google.com/document/d/1FvmYUC0S0BkdkR7wZsg0hLdK... Might be useful for you, gildas.


Thank you Paul! I had read this document some time ago, especially to see how the shadow DOM was serialized.


The browser compatibility section suggests MHTML is unsupported in current versions of Firefox and Safari.


I don't think it was ever native in Firefox, there is/was the excellent unMHT extension that was broken by Quantum/WebExtensions and The Great XUL Silliness. Shame.

I have Waterfox-Classic and unMHT (fished out of the Classic Addons Archive, just remember to turn off Waterfox's multiprocess feature) since I occasionally need to archive web pages - and more importantly, reopen them later.

mhtml is just MIME, literally every discrete URL as a MIME part with its origin in a Content-Location header, all wrapped in a multipart container. I don't understand why it's not a default format.


I can see WebExtensions breaking it (as it's a completely new set of APIs for extensions, and the losses do definitely still hurt)... but quantum/xul? How is that related, aside from "it happened around the same time"?


IANA firefox dev: XUL/XPCOM = old APIs, WebExtensions = new (multi-browser) API

Quantum was the the project name to re-engineer Firefox internals, with lots of design changes, not just plugins. XUL/XPCOM APIs were dropped, as an occasional programmer I understand why, "Quantum broke my plugins" is a reasonable first approximation for most users.


Safari supports webarchive, which does basically the same thing


The problem is that it is a proprietary format. The advantage of the format produced by SingleFile (HTML) is that as long as your browser is capable of interpreting HTML, you will be able to read your archives without worries.


Not so proprietary. It's really just a plist file, which the format is known and even open sourced by Apple[1]. Really it's only proprietary in that no other platforms have implemented it.

[1]: https://opensource.apple.com/source/CF/CF-550/CFBinaryPList....


For anyone else that didn't read the README, MHTML is mentioned in the comparison section https://github.com/gildas-lormeau/SingleFile#file-format-com...


Take the comparison with a grain of salt. Not including WARC is like excluding water from a comparison of beverages, it is the baseline standard.


> MHTML, (...) is a web page archive format used to combine, in a single computer file, the HTML code and its companion resources (such as images, Flash animations, Java applets, (...)

Well that goes to show its longevity I guess.


Does anyone else get two security warnings whenever you try to save an MHTML page using a Chrome extension? I have to click on one warning's button to confirm that I indeed want to save the "dangerous" file and another to confirm I'm really sure. It's gotten very annoying. I've looked all over for an option to disable this behavior but haven't been able to.


I’ve extensively looked into this as I can’t find a good light and easy backup options that isn’t extreme overkill.

I thought MHTML was NOT standardized which is why it wasn’t across all browsers yet. From what I remember, every company was doing their own implementation of it. Maybe it’s gotten more standardized the last few years though.


I've always thought the "M" stood for "Microsoft" -- wasn't even aware any browsers other than IE supported it.


There is also CHM which is actually a Microsoft only file format for "Compiled HTML Help" files.


I love this format. Very fast and compact. Entire Visual Studio help was in it once. Worked VERY well. And there's a KDE/Qt reader.


And it generally does not do a good job


What are the issues?


The big one in my experience is it doesn't play well at all with JavaScript. Single file to my knowledge (I experimented with it briefly) allows all js to load on page and can then embed loaded media as base64. I think it also has heuristics to embed relevant js as well. It still only gets you 90% of the way there, and I came to the conclusion that unless you are doing web archive type work or need audio / video a composite image works well


From my experience, wrong layout,missing pictures.


Unfortunately mhtml is not widely supported.


I remember saving webpages in MHTML when I was using dial-up so that I could read them offline later.

I would also download entire websites using a software which name I forgot, to read them offline. Back when websites held in a single floppy disk.

Good times!


I remember using HTTrack for this a while back. Still have a few of those sites lying around, I think.


I was gonna say Opera (the old, good one) had this. When saving a page there were some options and one was a single file IIRC.


I use this Chrome extension to save web pages as MHTML: https://chrome.google.com/webstore/detail/save-webpages-offl...


IIRC, back in the day mhtml won’t save java applets.


Are any sites still using applets these days?


80% of server IPMI Web control panels. But who whould want to save those anyway? :)


A lot of those are getting HTML5/Canvas based implementations and most of the old AST BMCs can get it through upgraded firmware.


None of my machines had any such upgrades and never will :(


Can we please stop with the 17MB GIF images used as demos? They use up lots of data immediately as you open the page, and are impractical, you don't know how long the animation is, can't forward/rewind, and you can't press fullscreen on a mobile.

And GitHub supports embedded videos in README.md files, videos are generally smaller than GIF files and their disabled autoplay is a feature = you save your data until you press play.


> GitHub supports embedded videos in README.md files

True since May 2021 so I think a lot of people are still finding this out...

In my experience GIF is still the most set-it-and-forget-it way to know a video will play, to get cross-platform support out of mp4 you may have to provide two different codecs. Anyway, not disagreeing with you and most gifs could drop 90% of their size with better choice of resolution and framerate. This readme is particularly egregious doing a screen capture with scrolling.

As for saving bandwidth until you want to play, I haven't tried this yet but it seems adequately clever to wrap a loading=lazy gif inside a details/summary tag: https://css-tricks.com/pause-gif-details-summary/


> to get cross-platform support out of mp4 you may have to provide two different codecs

Video codecs are not my area of expertise. Which codecs are these and what tool(s) would you typically use to ensure you provide them?


See this gist and the code comments [0] - basically you just need to know the magic flags to pass to ffmpeg, transcoding the file with all the right settings.

[0] https://gist.github.com/ingramchen/e2af352bf8b40bb88890fba4f...


Not to mention that H264 can take quite a bit of horsepower to decode and play as well (assuming your machine doesnt have a hardware chip specifically for doing just that)


Which machine doesn't? Anything in the last 10 or so years will decode H264 with much less power than GIF because of it. Even a Pi supports it.


My 2014 Thinkpad X1 Carbon (gen 3) doesn't have hardware transcoding as far as I can tell made Zoom and Discord impossible to use for class, especially because there was no way (that I knew of) to disable all video except the presenter. Even playing a YouTube video on it makes it ramp up.


I'm not sure which CPU you have specifically but the lowest-end model of the X1 Carbon Gen3 has an i5-5200U [1] that lists Intel Quick Sync Video support.

From the wiki page for Quick Sync [2]:

> Intel Quick Sync Video is Intel's brand for its dedicated video encoding and decoding hardware core. Quick Sync was introduced with the Sandy Bridge CPU microarchitecture on 9 January 2011 and has been found on the die of Intel CPUs ever since.

I can't confirm but I'd guess your performance issues lie elsewhere than in the h264 decoding specifically.

[1] - https://ark.intel.com/content/www/us/en/ark/products/85212/i...

[2] - https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video


If you check out the generation-codec table in that wikipedia article [1], under Broadwell (I believe that's the 5200U's generation name), it says there is support for AVC (which I believe is H264, I'm not a codec wiz), so that's a really good point. I'm not sure why I've consistently had issues with this on my machine then. I wonder if this is something with a configuration on Linux then?

Thanks for pointing that out. I've looked at this table before and payed attention to HEVC, not AVC, so I believe that's where my mistake came from.

[1] https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video#Hardwar...


AVC is H.264, yes.

Accelerated video decode is often disabled by default on Linux versions of browsers and can be quite dependent on versions of drivers/mesa/X-vs-Wayland/etc.


YouTube by default prefers newer, bitrate saving codecs over old ones if it thinks your CPU can handle software decoding them. On my 2017 Dell XPS 1080p and lower resolutions on YouTube play in software decoded AV1, only 1440p and higher play in hardware decoded VP9, so playing 4K video on YouTube is less taxing for my CPU than playing a 1080p video....


You can use h264ify extension to fix it.


The problem is Zoom and Discord are doing multiple streams. But it really shouldn't be a problem.

H.264, even on the high profile is not CPU intensive on a 2014 machine. Unless you are watching 1080P with 5-10Mbps, which is not the norm for internet video.


Is this really still an issue in 2022? How many people are browsing the internet on a device that can't do hardware H264 decoding?


Some browsers have poor hw decoding support on Linux (their problem, not drivers) but its gotten a lot better recently.


Author here, sorry for the GIF file. I created it because people were not happy with the video hosted on Youtube. AFAIK, video files did not work when I did this demo. I'll try to improve this in the future.


I wanted to comment on how useful that demo was to me. It did a great job at demonstrating why this is useful and how well it works compared to the native browser implementation. Thank you both for the demo and for the project!


Thanks :)


Note that you can trim about 1/3 of that video by using an online GIF compressor such as https://www.freeconvert.com/gif-compressor (no affiliation).


Would you mind sharing which tools you used to create that demo? It is really well done.



GitHub only recently expanded video support from gif to decent video formats, and many github enterprise installs don't have those new features yet. So, keep spreading the word.


I think there is some nuance here.

If the demo sequence is <5 seconds, I have never found myself becoming impatient. Gif is perfect for very brief demos. Anything longer than that and I'd like to have some idea where I am at in the video stream (and other controls as indicated)


Just the "intro" splash takes more than 5 seconds, and it's totally unnecessary

Yes, the gif bothered me too :D


It would seem that this subject is very subjective. However, I admit that I let myself go a bit on the intro and I can understand your point of view.


I wish browsers came standard, preconfigured with warning dialogs that triggered if assets attempting to load were beyond some threshold. That threshold could be decided by the browser vendors group based on some collection of network statistics and be adjusted on an annual basis or so.


> And GitHub supports embedded videos in README.md files

Any documentation on this? Because I have tried to embed video in issues and PRs before, and did not manage. I'm hoping such documentation will explain how this extends to issues and PRs.


In issues its just drag-n-drop.


Giving a massive upvote for this, disappointed and confused to see you've been downvoted here. There's literally no reason to use GIFs like this, and - as you stated, it's massively disrespectful to those not fortunate enough to have broadband connections, but would like access to the information.

Using data so wastefully like this always reeks of privilege to me - especially on something like GitHub. Wikipedia, for instance, never allows things like this.


> disappointed and confused to see you've been downvoted here

Because it's a relatively new feature, and probably, a lot of devs don't know about it (I didn't).

I did this [animated gif] once actually, before the feature was introduced, and I definitely hated it, but I had no choice.

Thanks for bringing this to the general attention, though :)


Sharing a project with the world and taking time to document it reeks of privilege? I really can't understand your reasoning.



The issue is mainly with mobile browsers, as mobile data is expensive..., Firefox on iPhone doesn't have about:config.


FYI, I updated the README page. Now, it includes a mp4 video weighting ~9MB.


I also want to give praise about the demo. It's one of the best demos I've ever seen with such a project. Nice job!


A 16MB gif with no playback controls, so you had to go through the tedium.


I would be surprised that the author wasn't using WebM to get a smaller filesize (not to mention higher quality) but the project itself leads me to believe that the author has a lot of free disk space to use


There’s no need to make further assumptions about the author (who btw took the time to build a very useful tool and share to in the Internet for free). Just point out the issue of the GIF and move along.


I think you missed something here; parent wasn't insulting the author. They were inferring that someone who made an archiving tool probably has resources that help with archiving (bandwidth means access to documents and storage space means a place to save them-- you get the idea).


I never made an assumption about the author and certainly never said that the tool wasn't useful. You can feel free to move along yourself, though.


I love, love this extension. I am working on an app to turn this into a single click bookmark system on Linux. Run an inotify service to watch your downloads and then process any Single file downloads to a database and update a browsable index.


You might like https://archivebox.io/, I think it can does this for you and then some.


TELL ME MORE.

I think I basically get the idea, what kind of database are you using? Recoll sounds like a good idea, but I'm also thinking about how I might also make this public-ish.

(i.e. I teach in college and would love to have a centralized way to store and search all my assigned readings, which are most often webpages)


I am not a trained software engineer but...

Each html page is processed by (1) getting url, title, time saved (this is under-rated as approximate time of saving is useful if you want to rediscover) and then (2) taking a screenshot and finally (3) extracting text with readability.js and hopefully doing some keyword analysis.

Right now it is stored in a local SQLite Database, although the article content is stored in text files. For search, I can use ripgrep to look through the associated text files.

The eventual goal is to create a flask app which will allow for interactive management of the bookmarks (tagging, searching). I've already got static generation of bookmarks.

Here's a screenshot: https://imgur.com/5YP4sP5


I archived (privately) some documentation pages from some of our vendors that were behind a login page using that just in case it became inaccessible at a critical time for us.


Awesome! Let's bring it up a notch - save every page automatically.

Then add a search engine over that, for 'what was that article about long term effects of DDT on ecosystems I was reading a long time ago?' queries.

And you get a memex - a way to outsource part of the brain to a computer :).


I'm using Recoll for this exact purpose. Just without inotify.


This sounds neat.


WANT


Maybe a little OT, but founders should take a careful look at this landing page. That's how you sell something. The demo is clear about the problem they're trying to solve and it convinced me that their product actually solves it. It's not just all the information they've included, but also the lack of irrelevant clutter.


I scrolled past the gif because I didn't realize that it was an informative gif. First few seconds looked like just an animated logo and I never stayed to watch it. Or it could've just started with an action instead of animating the logo.


you actually sat there and watched a 17MB gif video at 1x speed with no controls? lol. worst landing page i've ever seen


I think this is the worst comment I've ever seen as well! ;)

BTW, it's not a landing page. This is the README on GitHub.


Related: I used to keep a collection of locally mirrored web pages a long time ago, with a legendary Firefox extension called ScrapBook [0] (now long retired). The surprise for me is that after all these years I still remembered the name...

While writing this comment I found that it lived on as a (now "legacy") new extension named ScrapBook X [1], and then yet another one named WebScrapBook [2], which seems to still be alive!

[0]: http://www.xuldev.org/scrapbook/

[1]: https://github.com/danny0838/firefox-scrapbook

[2]: https://addons.mozilla.org/en-US/firefox/addon/webscrapbook/


Should be noted Manifest V3 will break this extension for chromium based browsers.

https://github.com/gildas-lormeau/SingleFile-Lite


Love the list of notable "features". :)


Also this:

> Benefits of the Manifest V3

> - None


What a cool project! I love the way this embeds images. One of things I miss most, though, when going back to old sites, is embedded audio or video. From looking at the options, it seems like it might be able to handle encoding video and/or audio as Data URIs, but it's not totally clear if SingleFile does this or not. I wasn't sure if I was doing the correct things to force this behavior in the options. It would be great if the README could clarify how these are handled by SingleFile. Sometimes it might be nice to be able to embed these sorts of things, even if it does make the HTML ridiculous and bloated. Or, barring that, maybe just a recommendation to use one of the other formats in the comparison table for this kind of use case.


Unfortunately that won't allow you to click links in your offline version. you can do this properly with wget: (sorry I don't know how to do code formatting in hackernews)

wget --mirror \ --convert-links \ --html-extension \ --wait=2 \ -o log \ https://example.com


Are you suggesting to mirror e.g. the entire Wikipedia through wget?

That is not only suboptimal, it is stressing on the server. At least you added a --wait=2, but on any large site/hoster/CDN, this might still get your IP banned or throttled. And on e.g. the English wikipedia this will then take 149 days. Which means that by the time you hit the last page, the first ones (and their links) are out of date.


If you add '--no-parent' (doesn't request anything that's not a page dependency above the requested URI) and a '--level=5' (only follows link 5 deep), you won't get all of a site. It makes it more realistic for getting wikipedia articles.


You don't need to newline every flag of a trivial command.


I'm guessing the user's intent was to have the command formatted across multiple lines.


Looks like SingleFile helps with sites where you have to be logged in, something that is not that easy with wget.


What are you talking about? I have hundreds of pages saved with SingleFile and I can click links in all of them.


Oh maybe it does work then. I assumed it didn’t follow links because they didn’t show it in the video.


Do you mean converting links so that they point to your local copy of other pages from the same site? Yeah it doesn't do that; it isn't a tool meant for archiving whole websites. It's just for archiving a single page, with all the content baked into it. Like a bookmark but doesn't linkrot.


Code formatting is just blockquotes.

  So one empty space followed by indented text (2 or more spaces)


Been eyeing this for a long time!

I'm building a bookmark app, and I plan to use this to save bookmarks!

I'm a simple man, nothing too fancy. Here's a crude demo in progress - https://zewallet.netlify.app/ Follow progress here - https://twitter.com/recursiveSwings/status/14917723874649088...

Would love to have ANY tips or feedback!


the signup email confirmation link points to http://localhost:3000/ btw

I'm definitely in the market for a bookmark service that archives my bookmarks, Diigo stopped working a year or two ago, and Pinboard can't stay up


Zotero deals with this reasonably well—and happens to be using SingleFile under the hood. Its landing page just targets a specific audience (academics), which means probably upwards of 90% of the people who would happily use it probably end up bouncing after thinking, "This isn't for me", before ever trying it. Give it a shot.


Ahh damn, should fix it! For now, you can edit the URL manually to take a peek. If you're interested, feel free to send a DM on Twitter @recursiveSwings, I'll let you know once it's in Beta! :)


Fixed now!


This is great. I've always wondered why this isn't the default behaviour for page saving in browsers. To an ordinary user saving a page implies saving a single file, not a file plus a directory of stuff. HAR can be useful but seems only for niche or specialised reasons.


I use HAR file extractor because normally I don’t want a single file I want a replica of the web servers file system structure including any dynamically loaded assets https://blog.cetinich.net/content/2022/download-website-and-...


From the warning before installing in FireFox:

    Add SingleFile? This extension will have permission to:
      - Access your data for all websites
      - Input data to the clipboard
      - Extend developer tools to access your data in open tabs
      - Download files and read and modify the browser's download history
      - Access browser tabs
... all of which I don't mind, as long as the extension can't exfiltrate any of the data it can access (send it to a third party). i.e.

    - no network connections from the extension
    - no modifying web pages or executing any code in the context of a web page
Does anyone know whether these sorts of extensions can exfiltrate data? -- this is a concern if the project author's credentials are stolen by a threat actor, which has happened before.


Security question: Is a web extension safe if it is installed but if you're not using it at the moment? For example, if I were logged into my bank's website and I did not click the SingleFile button in the extension toolbar, could it still theoretically collect info from my bank's webpage or do other actions?

I'd like to use SingleFile and have no reason at all to distrust it, but I'd like to understand the security impact of installing lots of web extensions. How do people handle security risks like that? Do you run a separate vanilla browser with no extensions for sensitive tasks?


For technical reasons beyond my control, SingleFile injects a (very small) script when the page loads even if you don't click on the button. It could also send any data to a third party server. Unfortunately, it is therefore impossible for me to technically and formally guarantee that SingleFile cannot behave maliciously. Note however that the extension has the status "recommended" on Firefox and that it undergoes a manual code review by Mozilla at each update.


On Chrome you can go into extension settings and adjust permissions so SingleFile only has permissions “on click”. Then it won’t/can't inject that little JS snippet into a page until you actually want to use the extension. The only downside is that after enabling you then have to refresh the page for the extension to do its work.

I wish this behavior was more well known and encouraged by Google.


You are absolutely right, thanks for the suggestion! I had totally forgotten about this feature.


Could you please elaborate what script is injected, that reason and why it is that out of your control? Thank you


I will do it, but it will take me some time to explain it and rather than answering on HN I will integrate it in the FAQ. I created an issue for this here: https://github.com/gildas-lormeau/SingleFile/issues/885.


In Firefox you could run a totally different profile.

I don’t do this myself, I try to research any extension I add and don’t do automatic upgrades. I use as little extensions as possible.


If you care about security, consider using Qubes OS with hardware-virtualized VMs for compartmentalization. Then, you Firefox for banking won't have the same extensions which you use elsewhere. Works for me.


How old is that demo gif? I just tried reproducing the normal saving shortcomings, and the bottom image ("Example of an SVG image with embedded JPEG images") loads just fine from the local folder, so this seems outdated.

That being said, it's a bit weird that this kind of tool is even necessary at all. I would have expected native saving to include CSS background graphics as well, but apparently they don't for some reason, so I think this is pretty useful. Until now, I have also used pandoc (--standalone) to merge all resources into a single HTML file which worked great.


The demo is approximately 2 years old. Things probably changed meanwhile.


We really, really need Web Bundles to progress and fix these problems correctly, once and for all. There are a lot of things that a tool like this can never get right, and the rest is complicated work that should never need to be done if we have a standard multi-file bundle format.

https://wicg.github.io/webpackage/draft-yasskin-wpack-bundle...


I was hoping this tool also solved a problem that comes from saving & reproducing JS-framework-heavy websites.

Here's the bug: According the HTML spec, elements like <h2> and <div> cannot be inside <a> tags. But using js you _can_ push <div>s instead of <a>s. (It happens from document.insert-type functions, frameworks like Angular/React allow this)

Look at nasa.gov, there's html:

  <a href="/press-release/nasa-invites-media-to-next-spacex-commercial-crew-space-station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196" class="card ubernode cards--card cards--2row cards--2col nodeid-477815 ember-view"><div class="bg-card-canvas" style="background-image: url(/sites/default/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);">
  <!---->    <h2 class="headline"> ...
    </h2>
  </div>
  </a>
After running this through SingleFile you can visually see the changes, but the html changes are:

  <a href="/press-release/nasa-invites-media-to-next-spacex-commercial-crew-space-station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196" class="card ubernode cards--card cards--2row cards--2col nodeid-477815 ember-view"></a>
  <div class="bg-card-canvas" style="background-image: url(/sites/default/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);">
  <h2 class="headline"> ...</h2>
The way that sites like Wayback Machine handle this is by using the web-replay library Wombat https://github.com/webrecorder/wombat that also uses JS to insert those elements.

But what the hell! I was working on a similar html-downloading/reproducing tool and this bug really bothers me. I'd either like the HTML reading standard to be updated to accept <div> inside of <a>, or also make that impossible to do via JS.


I think this issue could be circumvented by manipulating the page (replacing images, frames, css etc.) in the tab itself (SingleFile does it in background with a DOMParser instance). The trick is to avoid HTML parsing.


The list of problems that Manifest V3 causes are just more reasons to never use Chrome.


> For security reasons, you cannot save pages hosted on https://chrome.google.com, https://addons.mozilla.org and some other Mozilla domains.

Interesting. What is it about those pages that makes saving them raise security issues?


That is not the extension issue, that's the Google/Mozilla policy thing.


Maybe because JS files (specifically add-ons) run from the local filesystem are given escalated privileges compared to normal usage, perhaps for ease of development. I'm just speculating, though.


I think it’s a limitation on all extensions applied by Chrome/Firefox. My guess is to stop extensions from making you force install more extensions or something...

(Also what’s up Andrew! YC S09 represent :wave_emoji:)


Does this simply remove the JavaScript or do something more clever? Because I think in the age of SPAs, the proper way to save "content pages" might be to execute the JavaScript once and serialize the resulting DOM back to HTML. I didn't find anything in the FAQ that explains if it does something like that.


It saves what you see (and remove JS by default). There is an option to embed the JS and another one to save the "raw" page but I would not say it is reliable. The cleverness lies more in the ability to produce light pages.


That's a nice and simple tool, good work. I'm personally using Zotero to save copies of web pages: https://www.zotero.org/. With the browser extension you can save a snapshot in a few seconds.


Zotero is actually using SingleFile under the hood to save web pages ;)


Oh, that’s nice :)


Similar project -> https://github.com/Y2Z/monolith

(I used both and ended up favoring monolith, but can’t remember why. I think they’re pretty comparable/am grateful for both of them)


I love monolith too, especially because it is so easy to modify. Only 2300 sloc!

Sure, it uses libraries to do the heavy lifting, but these are all popular, well-tested libraries with well-scoped feature sets (html5ever for parsing HTML, url for parsing urls, etc).

If you're looking for a tool like this but think you might need to tweak it, you should give monolith a try.


How to toggle reader mode/readability? It doesn't seem to be able to save pages when I toggled chromium's reader mode on.

I followed the other advice on this thread. In the options:

- Annotation editor > default mode > format the page

- Annotation editor > annotate the page before saving

It automatically format the page into reader mode then I can click "Save the page" icon to save it. But sometimes I want to download the page as is. Like this thread for example. "Restore all removed elements" button doesn't seem to work to revert the changes.

For now I just set default as mode as normal and enable "annotate the page before saving", and then click "Format the page for better readability" when needed.


You could also create 2 separate profiles in the options page. One profile would open the annotation editor and the other would not. Then, you would just have to save pages with the appropriate profile. To restore the page, you should click on the "format page" icon.


Thank you. I'll give multiple profiles a try. As mentioned by some of the comments. It's a good tool for managing bookmarks. ArchiveBox supposedly does something similar but I couldn't make it work.


Thank you for the feedback! Regarding the bookmark management, it could certainly be better. I would have to find some time to code a bookmark manager extension based on SingleFile maybe.


Thank you for the extension. It works well enough for my bookmarks. Just open the bookmarks on my browser and use the "Save all tabs" feature.

I think I'm not the only one who wants an alternative to pocket. A bookmark manager that can archive the links to prevent linkrot.


The most impressive part of the demo is seeing how tidy his Downloads folder is!


Nice project! This project, and a similar project called Monolith[0], was a bit of an inspiration for making my own single HTML file tool called Humble[1] to solve a few edges cases I was having with bundling pages (and since I wanted a TypeScript API for making page bundles).

[0] https://github.com/Y2Z/monolith

[1] https://github.com/assemblylanguage/humble



I'm building a tool for people have a personal archive to their digital life so that 30 years from now they can revisit content they enjoyed in their younger years.

https://github.com/sergiotapia/ekeko

This is awesome! I would love to integrate this somehow into my project to "singlefile" bookmarks as people make them.

@gildas do you have any recommendation on how to approach this with your extension? Could I run a headless chrome and trigger this extension?


I confirm that you could use a headless browser for this. This is actually what SingleFile CLI does [1]. Here is an example of JS code showing how to configure and inject SingleFile with puppeteer [2].

[1] https://github.com/gildas-lormeau/SingleFile/tree/master/cli

[2] https://github.com/gildas-lormeau/SingleFile/blob/master/cli...


Thank you!


Love this. Use it all the time. Handy for saving huge pages with all the styling intact for reading offline (like on a plane). You could save a webpage as a PDF, but I prefer this over a PDF.


Is this still on track to become a standard? https://github.com/WICG/webpackage


Even after reading the praise here I wasn’t prepared for how good and useful this extension is. It’s a perfect solution for saving local copies of web pages. I do this frequently, and am surprised I didn’t know about this until today. Even the way it handles settings for the extension is great, with good, built-in documentation. The ability to add annotations is icing, and since they become part of the HTML there is no lock-in or special file format needed.


This is good for people who don't have constant internet access who need to reference web resources offline.

Webpage saving technology does not seem to have kept pace with the evolution of the web.

Images loaded by CSS aren't saved at all. JavaScript on the page will often hijack a saved page and not let it display at all.

One option that works fairly well and does not require installing a browser extension is to save the page as a PDF.

I wish browser developers would put more effort in this area.


This is what 10 year old me thought "Save As" in IE would do, but soon realized the harsh reality of "that's not how any of this works".


Does it have the option to automatically save every page you navigate to? There were some extensions back in the 2000s ("slogger" I think was one, "shelve" or something similar was another) but I don't think they work any more. The pages I think to save now are never the ones I want to look at 5 years down the road.


It does, although I doubt you would want to do this, it's rather slow.


If you keep the javascript, you also get the world's most portable (desktop) application format...


Opening the repo makes you download a 17MB gif. I hope you are not on expensive mobile connection.

p.s. the demo is nice


Why does this need to:

- Read and change all your data on all websites

- Modify data you copy and paste

- Manage your downloads

Is there a way to use a version that requires less of these permissions? e.g. it seems we can address the first permission by only activating it on click, but I'm not sure if that addresses the other ones.


I try to use optional permissions as much as I can. The first permission is required because of assets and frames stored on third-party servers. The second permission should be optional, I don't remember why it's not. I'll try to see if I can make it optional. The last permission is required in order to save the page on the filesystem with the "downloads" API. Note that even if I make these permissions optional, you might still have to trust me anyway ;)


Relevant 'awesome' list for web archiving: https://github.com/iipc/awesome-web-archiving

There are many similar tools there, from archiving to rendering.


In the olden days, Internet Explorer used to allow you to do this by saving the page to a HTM file. It would be a single archive with HTML and images etc embedded.

New browsers don't seem to do this, the create a separate folder for the assets, which is super annoying.


The Chromium Edge can produce .MHT files as well


Dang it, he beat me to it! I have been toying with the idea for quite some time, but this implementation is great, better than mine would have been, so I'm glad he did it.

Maybe I'll make a CLI implementation (sorta like wget but with this tacked on...)


This is great for a page. I'd love to see it expanded to include an entire site.


Chrome can save to a single file (.mhtml). I am not sure I understand the difference.


The difference is the output format. I created SingleFile before Chrome supported MHTML files. At that time, to save web pages in a single file, the only technical solution in Chrome was to implement something like SingleFile. The advantage of HTML is that this format is much more durable though.


That makes sense. I also used Data URIs to generate distributable web pages.


Yes, there is .mhtml but it execution plainly sucks because it didn't exactly saves everything. It would attempts to save but it won't be valiant at it, it is like using mhtml without "force (-f) argument".


Great stuff!

For some reason, I went in expecting to see a JS-enabled multi-page web site into a SPA in a single HTML file, but I didn't expect to see images get embedded.

Perhaps offer a recursive traversal option too, but don't try that on Wikipedia :)


Back in the day this was always one thing that had me begrudgingly and shamefully opening IE so I could save a page as an MHT file. So long ago now. Cool to see this idea has been revived and not in a proprietary way


I love SingleFile and have been using it for years! Is there any version that works on current mobile browser versions? I've stuck with an old version of Firefox on Android that still supports the extension.


You should be able to use it on Firefox for Android Nightly (which is very stable) by following this procedure: https://blog.mozilla.org/addons/2020/09/29/expanded-extensio...

> approx


Thanks for this. I expected to see a pricing link somewhere, having been attuned to all the subscription Saas these days. Glad to see there are tools offering immense value for free still.


It is in fact more or less self-financed by... hmmm... a SaaS that I market but it's in B2B.


I use SavePageWE, it can save the page (into single file) as it was modified by JS after load, which is often useful.

The only thing I miss I wish it was easier to script.


I have been using WebScrapBook (an add-on for Firefox) for some time. I really like it. Has anyone else some experience with this add-on? Good or bad.


I've been using it for a couple years (2 maybe) and I like it quite a bit as a quick and easy way to save pages. ArchiveBox looks fantastic, but I just don't have the motivation to set up the service and maintain it since I don't save enough links to make it worthwhile. SingleFile might be worth a shot, but it looks like WebScrapBook has been handling your needs just fine (they seem to have 90% of the same functionality).


Similar approaches were proposed at https://github.com/wabarc/wayback


Thanks!

ArchiveBox does indeed look fantastic. Their homepage alone is beautiful.

I bookmarked both ArchiveBox and now also SingleFile, but WebScrapBook gets the job done (in almost all cases).


As a webscrapbook user, do you know if there is a migration path from pocket or another hosted service?


Don't know about a migration option, but I do remember there's a lot of custom configuration possible.


I've been using this since Martin posted about it on Ghacks. Love using it and thank you gildas.


Thank you! I've been looking for this for a while, nice to see someone finally did it!


Microsoft had something called MHTML that did this about 20 years ago ... Tablet PC era.


Naming a thing takes creativity and luck. Congratulations on an excellent name!


Does it create an inline dataurl for each image even if they're the same?


Most of the time, it will be able to deduplicate them. For example, if you save this page https://groups.google.com/a/chromium.org/g/chromium-extensio..., each different avatar will be embed only once. To achieve this, SingleFile stores the content of duplicate images in CSS custom properties, displays them as background images in the IMG tags and uses a (properly sized) transparent SVG image as SRC. Thus, stylesheets are not broken.


Nice, that definitely saves some space


In fact, SingleFile implements several ways to save space. In practice, the most effective mechanism is shaking the CSS tree.

It's amazing how much CSS is useless in a page. It's especially annoying for SingleFile if it contains images... That's why SingleFile removes (almost) all unused rules, selectors and CSS properties by calculating the CSS cascade.


This would be very useful in many situations, and a great demo!


I'd also recommend "Print Edit WE" and "Save Page WE" [2] for Chrome type browsers, both by one author. First one allows for editing of the page before printing/saving (as a single page HTML or MHTML), second one allows for single-page save.

[1] https://chrome.google.com/webstore/detail/print-edit-we/olnb... [2] https://chrome.google.com/webstore/detail/save-page-we/dhhpe...


If it’s a single file, then how do the images get stored?


Images are stored as data URIs [1]. Note that they could also be stored as entries in a zip file too! [2].

[1] https://en.wikipedia.org/wiki/Data_URI_scheme

[2] https://github.com/gildas-lormeau/SingleFileZ


They're base64 encoded[0]. (This is an approach I myself have used in the past for simplifying the archival of regulatory texts.)

[0] https://github.com/gildas-lormeau/SingleFile/blob/15801c8ef4...


You read my mind, I was exactly looking for that!


using it to export logseq page, works perfectly.


wget -r url ?


Ah, millennials invented .mht


Iran has a habit of using tools like this to trick defense contractors into using their page.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: