SingleFile: Save a complete web page into a single HTML file

gildas · on March 2, 2022

Author here, it makes me really happy to see SIngleFile on the front page of HN. Thank you! I take the opportunity to make you aware of the upcoming impacts of the Manifest V3 [1], and for those who prefer zip files, I recommend you to have a look here [2].

[1] https://github.com/gildas-lormeau/SingleFile-Lite

[2] https://github.com/gildas-lormeau/SingleFileZ

mieko · on March 2, 2022

Thanks for this project. I found SingleFile a year or two ago, and used it to take "HTML Screenshots" of third party sites I could embed in guided walkthroughs with modified/example data changed, instead of just PNGs.

SingleFile was ultra-valuable for this.

If anyone has a similar use-case, I wrote some pretty rough (and slow) code to post-process SingleFile's output to remove any HTML that wasn't contributing to the presentational render by launching puppeteer and comparing pixels. It's available here: https://github.com/mieko/trailcap

gildas · on March 2, 2022

It's interesting! I had started something similar as part of testing but hadn't really finished my work. I will have a look at your project.

bscphil · on March 3, 2022

One very useful thing you could add to this (if you feel like it) would be to make it work with snapshotted directories, rather than a single HTML file with inlined data. You can get the former with SingleFileZ and then extracting the resulting zip file.

I like these because it makes it easier for me to make manual edits when necessary and it's a better solution for long term archiving (IMO). But I would love to add your project to my workflow.

MandieD · on March 2, 2022

Oh, that’s such a good idea for documentation - thanks for sharing!

badsectoracula · on March 2, 2022

Single File is one of my favorite addons since it allows me to keep offline copies of articles, tutorials, etc i see online without losing images, etc (there have been a ton of articles lost over the years and while some are preserved in archive.org, they often lack things like images, etc, so i prefer to save anything i come across). So thank you for making it :-).

Now, having said that, the text in SingleFile-Lite's "Notable features of SingleFile Lite" sound like a list of issues :-P. It looks like these are issues with Chrome, but do you know if/how these "improvements" will affect Firefox?

gildas · on March 2, 2022

AFAIK, for the moment Mozilla is aware of the regressions that Manifest V3 causes and shows a good will to try to reduce them as much as possible. You can find some information about this here https://github.com/w3c/webextensions/tree/main/_minutes

saaspirant · on March 3, 2022

I use archivebox(Django-based) for keeping offline copies (which uses SingleFile and few other libs in backend) https://github.com/ArchiveBox/

wabarc · on March 3, 2022

You might be interested in https://github.com/wabarc/wayback, a chat-based archiving tool.

joisig · on March 2, 2022

Thank you for the Manifest V3 critique, the examples you give make it really clear how many things are regressing with this upcoming change :/

gwbas1c · on March 2, 2022

FYI: Figure tags don't convert their hrefs to base64.

For example, try saving my home page: https://andrewrondeau.herokuapp.com/

The img tags are converted correctly, but there's still <figure class=image><a href="https://andrewrondeau.herokuapp.com/... in the single HTML file.

gildas · on March 2, 2022

I cannot reproduce your issue, I just did a test on this page and I see the expected `<img src="data:image/jpeg;base64,...` in the saved page.

gwbas1c · on March 3, 2022

Look for figure tags, not image tags

Even better, just search for ="http

gildas · on March 3, 2022

All I can find is a <a> tag.

Edit: I think I now understand the issue. I confirm SingleFile saves only the current page and not linked images, for example.

JeremyNT · on March 2, 2022

I've been using SingleFile for the last year or so, it's amazing!

I'm going to hijack your post for a question! I love the way you can use the editor and select "format for better readability," then save just the stripped down version of the page. I use this to send it to my e-ink device.

The question I have is whether it's possible to toggle the default save to use the formatted version automatically? I dug into the options and didn't turn anything up!

gildas · on March 2, 2022

You can enable these options for this:

- Annotation editor > default mode > edit the page

- Annotation editor > annotate the page before saving

gildas · on March 2, 2022

Sorry, I was wrong, you have to select "format the page" instead of "edit the page" (first item).

williamdclt · on March 3, 2022

Thank you for your work! I've been using start.me for my new tab page (it's the page you see every single time you open a tab, I can't believe most people don't make it useful), but it's way too slow so I SingleFile it and have a local Firefox extension to set it as my new tab page.

It's too complicated but at least it works!

aantix · on March 2, 2022

Is it possible to use this within the context of the current web page, without the extension portion?

Taking a snapshot of my user's screen and then display it to them later (maybe in an iFrame)?

gildas · on March 2, 2022

It's possible but it's a bit limited. It won't be able for example to save images coming from a different origin.

hrgiger · on March 2, 2022

Thanks for the work you have done, its a lazy man heaven especially for bulk downloads and helped me a lot. About a month ago I have decided to backup my bookmarks via archivebox, it was more than 1k bookmarks, most reliable methods were singlefile and wget.

graderjs · on March 3, 2022

Well done, gildas! You have really built a quality product and marketed your work well.

unicornporn · on March 3, 2022

I love it and use it. But product? Nothing is being sold and product sounds almost condescending. :)

graderjs · on March 3, 2022

> But product? Nothing is being sold and product sounds almost condescending. :)

Really!? You don't think it's trying to be condescending to try to find criticism on someone's praise, haha? Heh, aanyway :) You feel product is bad!? So weird!! I guess you find there what you bring to it. Wonder what you're protecting there, if you share more of your thinking, we can do know you more. Even so, I think we can just celebrate gildas' achievement! :)

gildas · on March 3, 2022

Thank you Cris!

austincheney · on March 2, 2022

Twelve year project with nearly 7000 commits shows a lot of dedication. Good work.

rahimnathwani · on March 2, 2022

If I start using SingleFile today, will I still be open saved pages after the update to Manifest V3?

I mean, if I want to save pages over the next 11 months, should I install SinglePage or SinglePage-lite?

gildas · on March 2, 2022

In fact, you simply do not need an extension to open pages saved with SingleFile (or SingleFile Lite) because they are standard HTML pages. So you don't have to worry about that.

warmwaffles · on March 2, 2022

This alone is fantastic. I've been looking for an mhtml replacement that worked well across all browsers.

cloudwizard · on March 2, 2022

Is there a configuration for the zip version where I can avoid duplicating the static assets? Thanks

gildas · on March 2, 2022

I guess you're referring to SingleFileZ. This option is not needed because zip files (i.e. what SingleFileZ produces) already provide this feature.

stragio · on March 2, 2022

Very nice! Will use it for sure. May I ask you how you created that good looking demo gif?

gildas · on March 2, 2022

I used:

- ScreenToGif to record video sequences and produce the final GIF: https://www.screentogif.com/

- Macro Recorder to record and replay user navigation: https://www.macrorecorder.com/

- Blender to edit the video, add text comments, and make the intro: https://www.blender.org/

e19293001 · on March 3, 2022

Thank you for sharing!

narag · on March 2, 2022

Thank you, very useful and works like a charm: a must have.

alberth · on March 2, 2022

FYI - there’s an official standard (MHTML) for doing this that has existed for 20+ years and exists natively in browsers.

https://en.m.wikipedia.org/wiki/MHTML

cookiengineer · on March 2, 2022

> FYI

The alternative format (used by the Internet Archive and Wayback Machine) is WARC. It's also a single file, but it's preserving the HTTP headers as well; so its applications is specifically for archival purposes. [1] The "wget" tool which is co-maintained by the Web Archive people also has support for it via CLI flags.

Though when it comes to mobile browser support I'd recommend to use MHTML, because webkit and chromium both have support for it upstream.

[1] http://iipc.github.io/warc-specifications/

[2] https://www.gnu.org/software/wget/wget.html

chefandy · on March 2, 2022

WARC is also used by the Webrecorder project. They made an app called Wabac which does entirely client-side WARC or HAR replays using service workers and it seems to have pretty good browser support, but I haven't really dug into the specifics.

https://github.com/webrecorder/wabac.js-1.0

wabarc · on March 3, 2022

There is a project that uses a headless browser to implement HAR.

https://github.com/wabarc/screenshot

londons_explore · on March 2, 2022

Is there any objection to adding WARC support to webkit/chromium? Seems like a not-so-complex project...

cookiengineer · on March 2, 2022

I know that WebKit relies on either libsoup [1] (on Linux/Unices) or curl [2] (legacy Windows and maybe WPE(?)) as a network adapter, so the header handling and parsing mechanisms would have to be implemented in there.

Though, on MacOS, WebKit tries to migrate most APIs to the Core Foundation Framework, which makes it kind of impossible to implement as a non-Apple-employee because it's basically a dump-it-and-never-care Open Source approach. [3]

Don't know about chromium (my knowledge is ~2012ish about their architecture, and pre-Blink).

[1] https://github.com/WebKit/WebKit/tree/main/Source/WebKit/Net...

[2] https://github.com/WebKit/WebKit/tree/main/Source/WebKit/Net...

[3] https://github.com/opensource-apple/CF

TingPing · on March 2, 2022

GTK/WPE use libsoup. Playstation/Windows uses curl. And yes Apples networking is proprietary.

cookiengineer · on March 3, 2022

I wasn't sure about WPE in regards to libsoup due to the glib dependencies and all the InjectedBundle hacks that I thought they wanted to avoid.

I mean, in principal curl would run on the other platforms, too...but as far as I can tell there's an initiative to move as much as possible to the CF framework (strings, memory allocation, https and tls, sockets etc) and away from the cross-platform implementations.

iKlsR · on March 2, 2022

Over a decade ago I had a laptop but no internet at home. This was one of the ways I taught myself programming (and also downloading dozens of manga) by using internet explorer at a cafe which had an option to save to mhtml which was one file and had everything self contained. Legit owe a portion of my success to this. I still have some of these files, old crusty hello world c++ tutorials etc.

falcolas · on March 2, 2022

I have fantastic internet, and I still do something similar. Local docs just load so much faster, and if something happens (which it still does, even on Fiber in the US), I have docs and can program.

Lemme see if I can pull up the command I use to mirror doc sites.

    wget \
      --recursive \
      --level=5 \
      --convert-links \
      --page-requisites \
      --wait=1 \
      --random-wait \
      --timestamping \
      --no-parent \
      $1

a9h74j · on March 2, 2022

For people who cannot afford internet access now, and for perhaps more in the future if times get more difficult, I believe this is a very important use-case.

paulirish · on March 2, 2022

The Chrome engineer who maintains the MTHML work wrote up a comprehensive doc on the modifications on the MHTML spec (RFC 2557) that are implemented: https://docs.google.com/document/d/1FvmYUC0S0BkdkR7wZsg0hLdK... Might be useful for you, gildas.

gildas · on March 2, 2022

Thank you Paul! I had read this document some time ago, especially to see how the shadow DOM was serialized.

rpdillon · on March 2, 2022

The browser compatibility section suggests MHTML is unsupported in current versions of Firefox and Safari.

mrspuratic · on March 2, 2022

I don't think it was ever native in Firefox, there is/was the excellent unMHT extension that was broken by Quantum/WebExtensions and The Great XUL Silliness. Shame.

I have Waterfox-Classic and unMHT (fished out of the Classic Addons Archive, just remember to turn off Waterfox's multiprocess feature) since I occasionally need to archive web pages - and more importantly, reopen them later.

mhtml is just MIME, literally every discrete URL as a MIME part with its origin in a Content-Location header, all wrapped in a multipart container. I don't understand why it's not a default format.

Groxx · on March 2, 2022

I can see WebExtensions breaking it (as it's a completely new set of APIs for extensions, and the losses do definitely still hurt)... but quantum/xul? How is that related, aside from "it happened around the same time"?

mrspuratic · on March 3, 2022

IANA firefox dev: XUL/XPCOM = old APIs, WebExtensions = new (multi-browser) API

Quantum was the the project name to re-engineer Firefox internals, with lots of design changes, not just plugins. XUL/XPCOM APIs were dropped, as an occasional programmer I understand why, "Quantum broke my plugins" is a reasonable first approximation for most users.

tekknik · on March 2, 2022

Safari supports webarchive, which does basically the same thing

gildas · on March 2, 2022

The problem is that it is a proprietary format. The advantage of the format produced by SingleFile (HTML) is that as long as your browser is capable of interpreting HTML, you will be able to read your archives without worries.

tekknik · on March 2, 2022

Not so proprietary. It's really just a plist file, which the format is known and even open sourced by Apple[1]. Really it's only proprietary in that no other platforms have implemented it.

[1]: https://opensource.apple.com/source/CF/CF-550/CFBinaryPList....

toqy · on March 2, 2022

For anyone else that didn't read the README, MHTML is mentioned in the comparison section https://github.com/gildas-lormeau/SingleFile#file-format-com...

dsl · on March 2, 2022

Take the comparison with a grain of salt. Not including WARC is like excluding water from a comparison of beverages, it is the baseline standard.

ByThyGrace · on March 2, 2022

> MHTML, (...) is a web page archive format used to combine, in a single computer file, the HTML code and its companion resources (such as images, Flash animations, Java applets, (...)

Well that goes to show its longevity I guess.

domador · on March 2, 2022

Does anyone else get two security warnings whenever you try to save an MHTML page using a Chrome extension? I have to click on one warning's button to confirm that I indeed want to save the "dangerous" file and another to confirm I'm really sure. It's gotten very annoying. I've looked all over for an option to disable this behavior but haven't been able to.

bgro · on March 2, 2022

I’ve extensively looked into this as I can’t find a good light and easy backup options that isn’t extreme overkill.

I thought MHTML was NOT standardized which is why it wasn’t across all browsers yet. From what I remember, every company was doing their own implementation of it. Maybe it’s gotten more standardized the last few years though.

chungy · on March 2, 2022

I've always thought the "M" stood for "Microsoft" -- wasn't even aware any browsers other than IE supported it.

chme · on March 2, 2022

There is also CHM which is actually a Microsoft only file format for "Compiled HTML Help" files.

IYasha · on March 2, 2022

I love this format. Very fast and compact. Entire Visual Studio help was in it once. Worked VERY well. And there's a KDE/Qt reader.

geitir · on March 2, 2022

And it generally does not do a good job

als0 · on March 2, 2022

What are the issues?

geitir · on March 3, 2022

The big one in my experience is it doesn't play well at all with JavaScript. Single file to my knowledge (I experimented with it briefly) allows all js to load on page and can then embed loaded media as base64. I think it also has heuristics to embed relevant js as well. It still only gets you 90% of the way there, and I came to the conclusion that unless you are doing web archive type work or need audio / video a composite image works well

hulitu · on March 2, 2022

From my experience, wrong layout,missing pictures.

admax88qqq · on March 2, 2022

Unfortunately mhtml is not widely supported.

rtsil · on March 2, 2022

I remember saving webpages in MHTML when I was using dial-up so that I could read them offline later.

I would also download entire websites using a software which name I forgot, to read them offline. Back when websites held in a single floppy disk.

Good times!

TheFlyingFish · on March 2, 2022

I remember using HTTrack for this a while back. Still have a few of those sites lying around, I think.

rplnt · on March 2, 2022

I was gonna say Opera (the old, good one) had this. When saving a page there were some options and one was a single file IIRC.

twapi · on March 2, 2022

I use this Chrome extension to save web pages as MHTML: https://chrome.google.com/webstore/detail/save-webpages-offl...

setum · on March 2, 2022

IIRC, back in the day mhtml won’t save java applets.

pstuart · on March 2, 2022

Are any sites still using applets these days?

IYasha · on March 2, 2022

80% of server IPMI Web control panels. But who whould want to save those anyway? :)

jabbany · on March 3, 2022

A lot of those are getting HTML5/Canvas based implementations and most of the old AST BMCs can get it through upgraded firmware.

IYasha · on March 5, 2022

None of my machines had any such upgrades and never will :(

black3r · on March 2, 2022

Can we please stop with the 17MB GIF images used as demos? They use up lots of data immediately as you open the page, and are impractical, you don't know how long the animation is, can't forward/rewind, and you can't press fullscreen on a mobile.

And GitHub supports embedded videos in README.md files, videos are generally smaller than GIF files and their disabled autoplay is a feature = you save your data until you press play.

jazzyjackson · on March 2, 2022

> GitHub supports embedded videos in README.md files

True since May 2021 so I think a lot of people are still finding this out...

In my experience GIF is still the most set-it-and-forget-it way to know a video will play, to get cross-platform support out of mp4 you may have to provide two different codecs. Anyway, not disagreeing with you and most gifs could drop 90% of their size with better choice of resolution and framerate. This readme is particularly egregious doing a screen capture with scrolling.

As for saving bandwidth until you want to play, I haven't tried this yet but it seems adequately clever to wrap a loading=lazy gif inside a details/summary tag: https://css-tricks.com/pause-gif-details-summary/

divbzero · on March 2, 2022

> to get cross-platform support out of mp4 you may have to provide two different codecs

Video codecs are not my area of expertise. Which codecs are these and what tool(s) would you typically use to ensure you provide them?

jazzyjackson · on March 2, 2022

See this gist and the code comments [0] - basically you just need to know the magic flags to pass to ffmpeg, transcoding the file with all the right settings.

[0] https://gist.github.com/ingramchen/e2af352bf8b40bb88890fba4f...

Melatonic · on March 2, 2022

Not to mention that H264 can take quite a bit of horsepower to decode and play as well (assuming your machine doesnt have a hardware chip specifically for doing just that)

tambourine_man · on March 2, 2022

Which machine doesn't? Anything in the last 10 or so years will decode H264 with much less power than GIF because of it. Even a Pi supports it.

jjice · on March 2, 2022

My 2014 Thinkpad X1 Carbon (gen 3) doesn't have hardware transcoding as far as I can tell made Zoom and Discord impossible to use for class, especially because there was no way (that I knew of) to disable all video except the presenter. Even playing a YouTube video on it makes it ramp up.

botdan · on March 2, 2022

I'm not sure which CPU you have specifically but the lowest-end model of the X1 Carbon Gen3 has an i5-5200U [1] that lists Intel Quick Sync Video support.

From the wiki page for Quick Sync [2]:

> Intel Quick Sync Video is Intel's brand for its dedicated video encoding and decoding hardware core. Quick Sync was introduced with the Sandy Bridge CPU microarchitecture on 9 January 2011 and has been found on the die of Intel CPUs ever since.

I can't confirm but I'd guess your performance issues lie elsewhere than in the h264 decoding specifically.

[1] - https://ark.intel.com/content/www/us/en/ark/products/85212/i...

[2] - https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video

jjice · on March 2, 2022

If you check out the generation-codec table in that wikipedia article [1], under Broadwell (I believe that's the 5200U's generation name), it says there is support for AVC (which I believe is H264, I'm not a codec wiz), so that's a really good point. I'm not sure why I've consistently had issues with this on my machine then. I wonder if this is something with a configuration on Linux then?

Thanks for pointing that out. I've looked at this table before and payed attention to HEVC, not AVC, so I believe that's where my mistake came from.

[1] https://en.wikipedia.org/wiki/Intel_Quick_Sync_Video#Hardwar...

zerocrates · on March 2, 2022

AVC is H.264, yes.

Accelerated video decode is often disabled by default on Linux versions of browsers and can be quite dependent on versions of drivers/mesa/X-vs-Wayland/etc.

black3r · on March 2, 2022

YouTube by default prefers newer, bitrate saving codecs over old ones if it thinks your CPU can handle software decoding them. On my 2017 Dell XPS 1080p and lower resolutions on YouTube play in software decoded AV1, only 1440p and higher play in hardware decoded VP9, so playing 4K video on YouTube is less taxing for my CPU than playing a 1080p video....

folmar · on March 2, 2022

You can use h264ify extension to fix it.

ksec · on March 3, 2022

The problem is Zoom and Discord are doing multiple streams. But it really shouldn't be a problem.

H.264, even on the high profile is not CPU intensive on a 2014 machine. Unless you are watching 1080P with 5-10Mbps, which is not the norm for internet video.

Mogzol · on March 2, 2022

Is this really still an issue in 2022? How many people are browsing the internet on a device that can't do hardware H264 decoding?

TingPing · on March 2, 2022

Some browsers have poor hw decoding support on Linux (their problem, not drivers) but its gotten a lot better recently.

gildas · on March 2, 2022

Author here, sorry for the GIF file. I created it because people were not happy with the video hosted on Youtube. AFAIK, video files did not work when I did this demo. I'll try to improve this in the future.

localhost · on March 2, 2022

I wanted to comment on how useful that demo was to me. It did a great job at demonstrating why this is useful and how well it works compared to the native browser implementation. Thank you both for the demo and for the project!

gildas · on March 2, 2022

Thanks :)

selcuka · on March 3, 2022

Note that you can trim about 1/3 of that video by using an online GIF compressor such as https://www.freeconvert.com/gif-compressor (no affiliation).

tux1968 · on March 2, 2022

Would you mind sharing which tools you used to create that demo? It is really well done.

gildas · on March 2, 2022

Sure! See here https://news.ycombinator.com/item?id=30530438

foobarbecue · on March 2, 2022

GitHub only recently expanded video support from gif to decent video formats, and many github enterprise installs don't have those new features yet. So, keep spreading the word.

bob1029 · on March 2, 2022

I think there is some nuance here.

If the demo sequence is <5 seconds, I have never found myself becoming impatient. Gif is perfect for very brief demos. Anything longer than that and I'd like to have some idea where I am at in the video stream (and other controls as indicated)

ajsnigrutin · on March 3, 2022

Just the "intro" splash takes more than 5 seconds, and it's totally unnecessary

Yes, the gif bothered me too :D

gildas · on March 3, 2022

It would seem that this subject is very subjective. However, I admit that I let myself go a bit on the intro and I can understand your point of view.

andrewmcwatters · on March 2, 2022

I wish browsers came standard, preconfigured with warning dialogs that triggered if assets attempting to load were beyond some threshold. That threshold could be decided by the browser vendors group based on some collection of network statistics and be adjusted on an annual basis or so.

berkes · on March 2, 2022

> And GitHub supports embedded videos in README.md files

Any documentation on this? Because I have tried to embed video in issues and PRs before, and did not manage. I'm hoping such documentation will explain how this extends to issues and PRs.

TingPing · on March 2, 2022

In issues its just drag-n-drop.

lostgame · on March 2, 2022

Giving a massive upvote for this, disappointed and confused to see you've been downvoted here. There's literally no reason to use GIFs like this, and - as you stated, it's massively disrespectful to those not fortunate enough to have broadband connections, but would like access to the information.

Using data so wastefully like this always reeks of privilege to me - especially on something like GitHub. Wikipedia, for instance, never allows things like this.

pizza234 · on March 2, 2022

> disappointed and confused to see you've been downvoted here

Because it's a relatively new feature, and probably, a lot of devs don't know about it (I didn't).

I did this [animated gif] once actually, before the feature was introduced, and I definitely hated it, but I had no choice.

Thanks for bringing this to the general attention, though :)

Zababa · on March 2, 2022

Sharing a project with the world and taking time to document it reeks of privilege? I really can't understand your reasoning.

wackget · on March 2, 2022

https://old.reddit.com/r/firefox/comments/aaek23/how_to_stop...

black3r · on March 2, 2022

The issue is mainly with mobile browsers, as mobile data is expensive..., Firefox on iPhone doesn't have about:config.

gildas · on March 3, 2022

FYI, I updated the README page. Now, it includes a mp4 video weighting ~9MB.

anned20 · on March 2, 2022

I also want to give praise about the demo. It's one of the best demos I've ever seen with such a project. Nice job!

netsharc · on March 2, 2022

A 16MB gif with no playback controls, so you had to go through the tedium.

Minor49er · on March 2, 2022

I would be surprised that the author wasn't using WebM to get a smaller filesize (not to mention higher quality) but the project itself leads me to believe that the author has a lot of free disk space to use

a1445c8b · on March 2, 2022

There’s no need to make further assumptions about the author (who btw took the time to build a very useful tool and share to in the Internet for free). Just point out the issue of the GIF and move along.

tomcatfish · on March 3, 2022

I think you missed something here; parent wasn't insulting the author. They were inferring that someone who made an archiving tool probably has resources that help with archiving (bandwidth means access to documents and storage space means a place to save them-- you get the idea).

Minor49er · on March 2, 2022

I never made an assumption about the author and certainly never said that the tool wasn't useful. You can feel free to move along yourself, though.

abnry · on March 2, 2022

I love, love this extension. I am working on an app to turn this into a single click bookmark system on Linux. Run an inotify service to watch your downloads and then process any Single file downloads to a database and update a browsable index.

makeworld · on March 2, 2022

You might like https://archivebox.io/, I think it can does this for you and then some.

jrm4 · on March 2, 2022

TELL ME MORE.

I think I basically get the idea, what kind of database are you using? Recoll sounds like a good idea, but I'm also thinking about how I might also make this public-ish.

(i.e. I teach in college and would love to have a centralized way to store and search all my assigned readings, which are most often webpages)

abnry · on March 2, 2022

I am not a trained software engineer but...

Each html page is processed by (1) getting url, title, time saved (this is under-rated as approximate time of saving is useful if you want to rediscover) and then (2) taking a screenshot and finally (3) extracting text with readability.js and hopefully doing some keyword analysis.

Right now it is stored in a local SQLite Database, although the article content is stored in text files. For search, I can use ripgrep to look through the associated text files.

The eventual goal is to create a flask app which will allow for interactive management of the bookmarks (tagging, searching). I've already got static generation of bookmarks.

Here's a screenshot: https://imgur.com/5YP4sP5

m-p-3 · on March 2, 2022

I archived (privately) some documentation pages from some of our vendors that were behind a login page using that just in case it became inaccessible at a critical time for us.

tomaskafka · on March 3, 2022

Awesome! Let's bring it up a notch - save every page automatically.

Then add a search engine over that, for 'what was that article about long term effects of DDT on ecosystems I was reading a long time ago?' queries.

And you get a memex - a way to outsource part of the brain to a computer :).

rhn_mk1 · on March 2, 2022

I'm using Recoll for this exact purpose. Just without inotify.

sitkack · on March 2, 2022

This sounds neat.

samstave · on March 2, 2022

bachmeier · on March 2, 2022

Maybe a little OT, but founders should take a careful look at this landing page. That's how you sell something. The demo is clear about the problem they're trying to solve and it convinced me that their product actually solves it. It's not just all the information they've included, but also the lack of irrelevant clutter.

ihateolives · on March 3, 2022

I scrolled past the gif because I didn't realize that it was an informative gif. First few seconds looked like just an animated logo and I never stayed to watch it. Or it could've just started with an action instead of animating the logo.

farzher · on March 3, 2022

you actually sat there and watched a 17MB gif video at 1x speed with no controls? lol. worst landing page i've ever seen

gildas · on March 3, 2022

I think this is the worst comment I've ever seen as well! ;)

BTW, it's not a landing page. This is the README on GitHub.

j1elo · on March 2, 2022

Related: I used to keep a collection of locally mirrored web pages a long time ago, with a legendary Firefox extension called ScrapBook [0] (now long retired). The surprise for me is that after all these years I still remembered the name...

While writing this comment I found that it lived on as a (now "legacy") new extension named ScrapBook X [1], and then yet another one named WebScrapBook [2], which seems to still be alive!

[0]: http://www.xuldev.org/scrapbook/

[1]: https://github.com/danny0838/firefox-scrapbook

[2]: https://addons.mozilla.org/en-US/firefox/addon/webscrapbook/

sharps1 · on March 2, 2022

Should be noted Manifest V3 will break this extension for chromium based browsers.

https://github.com/gildas-lormeau/SingleFile-Lite

photon-torpedo · on March 2, 2022

Love the list of notable "features". :)

a1445c8b · on March 2, 2022

Also this:

> Benefits of the Manifest V3

> - None

IggleSniggle · on March 2, 2022

What a cool project! I love the way this embeds images. One of things I miss most, though, when going back to old sites, is embedded audio or video. From looking at the options, it seems like it might be able to handle encoding video and/or audio as Data URIs, but it's not totally clear if SingleFile does this or not. I wasn't sure if I was doing the correct things to force this behavior in the options. It would be great if the README could clarify how these are handled by SingleFile. Sometimes it might be nice to be able to embed these sorts of things, even if it does make the HTML ridiculous and bloated. Or, barring that, maybe just a recommendation to use one of the other formats in the comparison table for this kind of use case.

reidjs · on March 2, 2022

Unfortunately that won't allow you to click links in your offline version. you can do this properly with wget: (sorry I don't know how to do code formatting in hackernews)

wget --mirror \ --convert-links \ --html-extension \ --wait=2 \ -o log \ https://example.com

berkes · on March 2, 2022

Are you suggesting to mirror e.g. the entire Wikipedia through wget?

That is not only suboptimal, it is stressing on the server. At least you added a --wait=2, but on any large site/hoster/CDN, this might still get your IP banned or throttled. And on e.g. the English wikipedia this will then take 149 days. Which means that by the time you hit the last page, the first ones (and their links) are out of date.

falcolas · on March 2, 2022

If you add '--no-parent' (doesn't request anything that's not a page dependency above the requested URI) and a '--level=5' (only follows link 5 deep), you won't get all of a site. It makes it more realistic for getting wikipedia articles.

hombre_fatal · on March 2, 2022

You don't need to newline every flag of a trivial command.

all2 · on March 2, 2022

I'm guessing the user's intent was to have the command formatted across multiple lines.

lysium · on March 2, 2022

Looks like SingleFile helps with sites where you have to be logged in, something that is not that easy with wget.

_dain_ · on March 2, 2022

What are you talking about? I have hundreds of pages saved with SingleFile and I can click links in all of them.

reidjs · on March 2, 2022

Oh maybe it does work then. I assumed it didn’t follow links because they didn’t show it in the video.

_dain_ · on March 2, 2022

Do you mean converting links so that they point to your local copy of other pages from the same site? Yeah it doesn't do that; it isn't a tool meant for archiving whole websites. It's just for archiving a single page, with all the content baked into it. Like a bookmark but doesn't linkrot.

z3c0 · on March 2, 2022

Code formatting is just blockquotes.

  So one empty space followed by indented text (2 or more spaces)

ctxc · on March 2, 2022

Been eyeing this for a long time!

I'm building a bookmark app, and I plan to use this to save bookmarks!

I'm a simple man, nothing too fancy. Here's a crude demo in progress - https://zewallet.netlify.app/ Follow progress here - https://twitter.com/recursiveSwings/status/14917723874649088...

Would love to have ANY tips or feedback!

TehShrike · on March 2, 2022

the signup email confirmation link points to http://localhost:3000/ btw

I'm definitely in the market for a bookmark service that archives my bookmarks, Diigo stopped working a year or two ago, and Pinboard can't stay up

cxr · on March 2, 2022

Zotero deals with this reasonably well—and happens to be using SingleFile under the hood. Its landing page just targets a specific audience (academics), which means probably upwards of 90% of the people who would happily use it probably end up bouncing after thinking, "This isn't for me", before ever trying it. Give it a shot.

ctxc · on March 2, 2022

Ahh damn, should fix it! For now, you can edit the URL manually to take a peek. If you're interested, feel free to send a DM on Twitter @recursiveSwings, I'll let you know once it's in Beta! :)

ctxc · on March 2, 2022

Fixed now!

als0 · on March 2, 2022

This is great. I've always wondered why this isn't the default behaviour for page saving in browsers. To an ordinary user saving a page implies saving a single file, not a file plus a directory of stuff. HAR can be useful but seems only for niche or specialised reasons.

brentcetinich · on March 2, 2022

I use HAR file extractor because normally I don’t want a single file I want a replica of the web servers file system structure including any dynamically loaded assets https://blog.cetinich.net/content/2022/download-website-and-...

TimTheTinker · on March 3, 2022

From the warning before installing in FireFox:

    Add SingleFile? This extension will have permission to:
      - Access your data for all websites
      - Input data to the clipboard
      - Extend developer tools to access your data in open tabs
      - Download files and read and modify the browser's download history
      - Access browser tabs

... all of which I don't mind, as long as the extension can't exfiltrate any of the data it can access (send it to a third party). i.e.

    - no network connections from the extension
    - no modifying web pages or executing any code in the context of a web page

Does anyone know whether these sorts of extensions can exfiltrate data? -- this is a concern if the project author's credentials are stolen by a threat actor, which has happened before.

mysterypie · on March 2, 2022

Security question: Is a web extension safe if it is installed but if you're not using it at the moment? For example, if I were logged into my bank's website and I did not click the SingleFile button in the extension toolbar, could it still theoretically collect info from my bank's webpage or do other actions?

I'd like to use SingleFile and have no reason at all to distrust it, but I'd like to understand the security impact of installing lots of web extensions. How do people handle security risks like that? Do you run a separate vanilla browser with no extensions for sensitive tasks?

gildas · on March 2, 2022

For technical reasons beyond my control, SingleFile injects a (very small) script when the page loads even if you don't click on the button. It could also send any data to a third party server. Unfortunately, it is therefore impossible for me to technically and formally guarantee that SingleFile cannot behave maliciously. Note however that the extension has the status "recommended" on Firefox and that it undergoes a manual code review by Mozilla at each update.

varenc · on March 2, 2022

On Chrome you can go into extension settings and adjust permissions so SingleFile only has permissions “on click”. Then it won’t/can't inject that little JS snippet into a page until you actually want to use the extension. The only downside is that after enabling you then have to refresh the page for the extension to do its work.

I wish this behavior was more well known and encouraged by Google.

gildas · on March 2, 2022

You are absolutely right, thanks for the suggestion! I had totally forgotten about this feature.

fmntf · on March 2, 2022

Could you please elaborate what script is injected, that reason and why it is that out of your control? Thank you

gildas · on March 2, 2022

I will do it, but it will take me some time to explain it and rather than answering on HN I will integrate it in the FAQ. I created an issue for this here: https://github.com/gildas-lormeau/SingleFile/issues/885.

prox · on March 2, 2022

In Firefox you could run a totally different profile.

I don’t do this myself, I try to research any extension I add and don’t do automatic upgrades. I use as little extensions as possible.

fsflover · on March 2, 2022

If you care about security, consider using Qubes OS with hardware-virtualized VMs for compartmentalization. Then, you Firefox for banking won't have the same extensions which you use elsewhere. Works for me.

phil294 · on March 2, 2022

How old is that demo gif? I just tried reproducing the normal saving shortcomings, and the bottom image ("Example of an SVG image with embedded JPEG images") loads just fine from the local folder, so this seems outdated.

That being said, it's a bit weird that this kind of tool is even necessary at all. I would have expected native saving to include CSS background graphics as well, but apparently they don't for some reason, so I think this is pretty useful. Until now, I have also used pandoc (--standalone) to merge all resources into a single HTML file which worked great.

gildas · on March 2, 2022

The demo is approximately 2 years old. Things probably changed meanwhile.

spankalee · on March 2, 2022

We really, really need Web Bundles to progress and fix these problems correctly, once and for all. There are a lot of things that a tool like this can never get right, and the rest is complicated work that should never need to be done if we have a standard multi-file bundle format.

https://wicg.github.io/webpackage/draft-yasskin-wpack-bundle...

civilian · on March 2, 2022

I was hoping this tool also solved a problem that comes from saving & reproducing JS-framework-heavy websites.

Here's the bug: According the HTML spec, elements like <h2> and <div> cannot be inside <a> tags. But using js you _can_ push <div>s instead of <a>s. (It happens from document.insert-type functions, frameworks like Angular/React allow this)

Look at nasa.gov, there's html:

  <a href="/press-release/nasa-invites-media-to-next-spacex-commercial-crew-space-station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196" class="card ubernode cards--card cards--2row cards--2col nodeid-477815 ember-view"><div class="bg-card-canvas" style="background-image: url(/sites/default/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);">
  <!---->    <h2 class="headline"> ...
    </h2>
  </div>
  </a>

After running this through SingleFile you can visually see the changes, but the html changes are:

  <a href="/press-release/nasa-invites-media-to-next-spacex-commercial-crew-space-station-launch-0" date="Wed Mar 02 2022 10:35:00 GMT-0800 (Pacific Standard Time)" id="ember196" class="card ubernode cards--card cards--2row cards--2col nodeid-477815 ember-view"></a>
  <div class="bg-card-canvas" style="background-image: url(/sites/default/files/styles/2x2_cardfeed/public/thumbnails/image/51846702013_a0cc55100a_k.jpeg);">
  <h2 class="headline"> ...</h2>

The way that sites like Wayback Machine handle this is by using the web-replay library Wombat https://github.com/webrecorder/wombat that also uses JS to insert those elements.

But what the hell! I was working on a similar html-downloading/reproducing tool and this bug really bothers me. I'd either like the HTML reading standard to be updated to accept <div> inside of <a>, or also make that impossible to do via JS.

gildas · on March 2, 2022

I think this issue could be circumvented by manipulating the page (replacing images, frames, css etc.) in the tab itself (SingleFile does it in background with a DOMParser instance). The trick is to avoid HTML parsing.

kwhitefoot · on March 2, 2022

The list of problems that Manifest V3 causes are just more reasons to never use Chrome.

tzs · on March 2, 2022

> For security reasons, you cannot save pages hosted on https://chrome.google.com, https://addons.mozilla.org and some other Mozilla domains.

Interesting. What is it about those pages that makes saving them raise security issues?

Isthatablackgsd · on March 2, 2022

That is not the extension issue, that's the Google/Mozilla policy thing.

amccollum · on March 2, 2022

Maybe because JS files (specifically add-ons) run from the local filesystem are given escalated privileges compared to normal usage, perhaps for ease of development. I'm just speculating, though.

varenc · on March 3, 2022

I think it’s a limitation on all extensions applied by Chrome/Firefox. My guess is to stop extensions from making you force install more extensions or something...

(Also what’s up Andrew! YC S09 represent :wave_emoji:)

codeflo · on March 2, 2022

Does this simply remove the JavaScript or do something more clever? Because I think in the age of SPAs, the proper way to save "content pages" might be to execute the JavaScript once and serialize the resulting DOM back to HTML. I didn't find anything in the FAQ that explains if it does something like that.

gildas · on March 2, 2022

It saves what you see (and remove JS by default). There is an option to embed the JS and another one to save the "raw" page but I would not say it is reliable. The cleverness lies more in the ability to produce light pages.

dgellow · on March 2, 2022

That's a nice and simple tool, good work. I'm personally using Zotero to save copies of web pages: https://www.zotero.org/. With the browser extension you can save a snapshot in a few seconds.

gildas · on March 2, 2022

Zotero is actually using SingleFile under the hood to save web pages ;)

dgellow · on March 2, 2022

Oh, that’s nice :)

didericis · on March 2, 2022

Similar project -> https://github.com/Y2Z/monolith

(I used both and ended up favoring monolith, but can’t remember why. I think they’re pretty comparable/am grateful for both of them)

octoberfranklin · on March 3, 2022

I love monolith too, especially because it is so easy to modify. Only 2300 sloc!

Sure, it uses libraries to do the heavy lifting, but these are all popular, well-tested libraries with well-scoped feature sets (html5ever for parsing HTML, url for parsing urls, etc).

If you're looking for a tool like this but think you might need to tweak it, you should give monolith a try.

_xkvm · on March 5, 2022

How to toggle reader mode/readability? It doesn't seem to be able to save pages when I toggled chromium's reader mode on.

I followed the other advice on this thread. In the options:

- Annotation editor > default mode > format the page

- Annotation editor > annotate the page before saving

It automatically format the page into reader mode then I can click "Save the page" icon to save it. But sometimes I want to download the page as is. Like this thread for example. "Restore all removed elements" button doesn't seem to work to revert the changes.

For now I just set default as mode as normal and enable "annotate the page before saving", and then click "Format the page for better readability" when needed.

gildas · on March 6, 2022

You could also create 2 separate profiles in the options page. One profile would open the annotation editor and the other would not. Then, you would just have to save pages with the appropriate profile. To restore the page, you should click on the "format page" icon.

_xkvm · on March 6, 2022

Thank you. I'll give multiple profiles a try. As mentioned by some of the comments. It's a good tool for managing bookmarks. ArchiveBox supposedly does something similar but I couldn't make it work.

gildas · on March 7, 2022

Thank you for the feedback! Regarding the bookmark management, it could certainly be better. I would have to find some time to code a bookmark manager extension based on SingleFile maybe.

_xkvm · on March 8, 2022

Thank you for the extension. It works well enough for my bookmarks. Just open the bookmarks on my browser and use the "Save all tabs" feature.

I think I'm not the only one who wants an alternative to pocket. A bookmark manager that can archive the links to prevent linkrot.

edf13 · on March 2, 2022

The most impressive part of the demo is seeing how tidy his Downloads folder is!

assemblylang · on March 2, 2022

Nice project! This project, and a similar project called Monolith[0], was a bit of an inspiration for making my own single HTML file tool called Humble[1] to solve a few edges cases I was having with bundling pages (and since I wanted a TypeScript API for making page bundles).

[0] https://github.com/Y2Z/monolith

[1] https://github.com/assemblylanguage/humble

wabarc · on March 3, 2022

Another similar project:

[0] https://github.com/wabarc/cairn

[1] https://github.com/go-shiori/obelisk

sergiotapia · on March 2, 2022

I'm building a tool for people have a personal archive to their digital life so that 30 years from now they can revisit content they enjoyed in their younger years.

https://github.com/sergiotapia/ekeko

This is awesome! I would love to integrate this somehow into my project to "singlefile" bookmarks as people make them.

@gildas do you have any recommendation on how to approach this with your extension? Could I run a headless chrome and trigger this extension?

gildas · on March 2, 2022

I confirm that you could use a headless browser for this. This is actually what SingleFile CLI does [1]. Here is an example of JS code showing how to configure and inject SingleFile with puppeteer [2].

[1] https://github.com/gildas-lormeau/SingleFile/tree/master/cli

[2] https://github.com/gildas-lormeau/SingleFile/blob/master/cli...

sergiotapia · on March 2, 2022

Thank you!

kosasbest · on March 2, 2022

Love this. Use it all the time. Handy for saving huge pages with all the styling intact for reading offline (like on a plane). You could save a webpage as a PDF, but I prefer this over a PDF.

megaman821 · on March 2, 2022

Is this still on track to become a standard? https://github.com/WICG/webpackage

leephillips · on March 3, 2022

Even after reading the praise here I wasn’t prepared for how good and useful this extension is. It’s a perfect solution for saving local copies of web pages. I do this frequently, and am surprised I didn’t know about this until today. Even the way it handles settings for the extension is great, with good, built-in documentation. The ability to add annotations is icing, and since they become part of the HTML there is no lock-in or special file format needed.

Mr_Modulo · on March 2, 2022

This is good for people who don't have constant internet access who need to reference web resources offline.

Webpage saving technology does not seem to have kept pace with the evolution of the web.

Images loaded by CSS aren't saved at all. JavaScript on the page will often hijack a saved page and not let it display at all.

One option that works fairly well and does not require installing a browser extension is to save the page as a PDF.

I wish browser developers would put more effort in this area.

moffkalast · on March 2, 2022

This is what 10 year old me thought "Save As" in IE would do, but soon realized the harsh reality of "that's not how any of this works".

jasonfarnon · on March 3, 2022

Does it have the option to automatically save every page you navigate to? There were some extensions back in the 2000s ("slogger" I think was one, "shelve" or something similar was another) but I don't think they work any more. The pages I think to save now are never the ones I want to look at 5 years down the road.

bscphil · on March 3, 2022

It does, although I doubt you would want to do this, it's rather slow.

manor · on March 2, 2022

If you keep the javascript, you also get the world's most portable (desktop) application format...

stanislavb · on March 2, 2022

Opening the repo makes you download a 17MB gif. I hope you are not on expensive mobile connection.

p.s. the demo is nice

avivo · on March 2, 2022

Why does this need to:

- Read and change all your data on all websites

- Modify data you copy and paste

- Manage your downloads

Is there a way to use a version that requires less of these permissions? e.g. it seems we can address the first permission by only activating it on click, but I'm not sure if that addresses the other ones.

gildas · on March 2, 2022

I try to use optional permissions as much as I can. The first permission is required because of assets and frames stored on third-party servers. The second permission should be optional, I don't remember why it's not. I'll try to see if I can make it optional. The last permission is required in order to save the page on the filesystem with the "downloads" API. Note that even if I make these permissions optional, you might still have to trust me anyway ;)