Author here, it makes me really happy to see SIngleFile on the front page of HN. Thank you! I take the opportunity to make you aware of the upcoming impacts of the Manifest V3 [1], and for those who prefer zip files, I recommend you to have a look here [2].
Thanks for this project. I found SingleFile a year or two ago, and used it to take "HTML Screenshots" of third party sites I could embed in guided walkthroughs with modified/example data changed, instead of just PNGs.
SingleFile was ultra-valuable for this.
If anyone has a similar use-case, I wrote some pretty rough (and slow) code to post-process SingleFile's output to remove any HTML that wasn't contributing to the presentational render by launching puppeteer and comparing pixels. It's available here: https://github.com/mieko/trailcap
One very useful thing you could add to this (if you feel like it) would be to make it work with snapshotted directories, rather than a single HTML file with inlined data. You can get the former with SingleFileZ and then extracting the resulting zip file.
I like these because it makes it easier for me to make manual edits when necessary and it's a better solution for long term archiving (IMO). But I would love to add your project to my workflow.
Single File is one of my favorite addons since it allows me to keep offline copies of articles, tutorials, etc i see online without losing images, etc (there have been a ton of articles lost over the years and while some are preserved in archive.org, they often lack things like images, etc, so i prefer to save anything i come across). So thank you for making it :-).
Now, having said that, the text in SingleFile-Lite's "Notable features of SingleFile Lite" sound like a list of issues :-P. It looks like these are issues with Chrome, but do you know if/how these "improvements" will affect Firefox?
AFAIK, for the moment Mozilla is aware of the regressions that Manifest V3 causes and shows a good will to try to reduce them as much as possible. You can find some information about this here https://github.com/w3c/webextensions/tree/main/_minutes
I've been using SingleFile for the last year or so, it's amazing!
I'm going to hijack your post for a question! I love the way you can use the editor and select "format for better readability," then save just the stripped down version of the page. I use this to send it to my e-ink device.
The question I have is whether it's possible to toggle the default save to use the formatted version automatically? I dug into the options and didn't turn anything up!
Thank you for your work! I've been using start.me for my new tab page (it's the page you see every single time you open a tab, I can't believe most people don't make it useful), but it's way too slow so I SingleFile it and have a local Firefox extension to set it as my new tab page.
Thanks for the work you have done, its a lazy man heaven especially for bulk downloads and helped me a lot. About a month ago I have decided to backup my bookmarks via archivebox, it was more than 1k bookmarks, most reliable methods were singlefile and wget.
> But product? Nothing is being sold and product sounds almost condescending. :)
Really!? You don't think it's trying to be condescending to try to find criticism on someone's praise, haha? Heh, aanyway :) You feel product is bad!? So weird!! I guess you find there what you bring to it. Wonder what you're protecting there, if you share more of your thinking, we can do know you more. Even so, I think we can just celebrate gildas' achievement! :)
In fact, you simply do not need an extension to open pages saved with SingleFile (or SingleFile Lite) because they are standard HTML pages. So you don't have to worry about that.
The alternative format (used by the Internet Archive and Wayback Machine) is WARC. It's also a single file, but it's preserving the HTTP headers as well; so its applications is specifically for archival purposes. [1] The "wget" tool which is co-maintained by the Web Archive people also has support for it via CLI flags.
Though when it comes to mobile browser support I'd recommend to use MHTML, because webkit and chromium both have support for it upstream.
WARC is also used by the Webrecorder project. They made an app called Wabac which does entirely client-side WARC or HAR replays using service workers and it seems to have pretty good browser support, but I haven't really dug into the specifics.
I know that WebKit relies on either libsoup [1] (on Linux/Unices) or curl [2] (legacy Windows and maybe WPE(?)) as a network adapter, so the header handling and parsing mechanisms would have to be implemented in there.
Though, on MacOS, WebKit tries to migrate most APIs to the Core Foundation Framework, which makes it kind of impossible to implement as a non-Apple-employee because it's basically a dump-it-and-never-care Open Source approach. [3]
Don't know about chromium (my knowledge is ~2012ish about their architecture, and pre-Blink).
I wasn't sure about WPE in regards to libsoup due to the glib dependencies and all the InjectedBundle hacks that I thought they wanted to avoid.
I mean, in principal curl would run on the other platforms, too...but as far as I can tell there's an initiative to move as much as possible to the CF framework (strings, memory allocation, https and tls, sockets etc) and away from the cross-platform implementations.
Over a decade ago I had a laptop but no internet at home. This was one of the ways I taught myself programming (and also downloading dozens of manga) by using internet explorer at a cafe which had an option to save to mhtml which was one file and had everything self contained. Legit owe a portion of my success to this. I still have some of these files, old crusty hello world c++ tutorials etc.
I have fantastic internet, and I still do something similar. Local docs just load so much faster, and if something happens (which it still does, even on Fiber in the US), I have docs and can program.
Lemme see if I can pull up the command I use to mirror doc sites.
For people who cannot afford internet access now, and for perhaps more in the future if times get more difficult, I believe this is a very important use-case.
I don't think it was ever native in Firefox, there is/was the excellent unMHT extension that was broken by Quantum/WebExtensions and The Great XUL Silliness. Shame.
I have Waterfox-Classic and unMHT (fished out of the Classic Addons Archive, just remember to turn off Waterfox's multiprocess feature) since I occasionally need to archive web pages - and more importantly, reopen them later.
mhtml is just MIME, literally every discrete URL as a MIME part with its origin in a Content-Location header, all wrapped in a multipart container. I don't understand why it's not a default format.
I can see WebExtensions breaking it (as it's a completely new set of APIs for extensions, and the losses do definitely still hurt)... but quantum/xul? How is that related, aside from "it happened around the same time"?
IANA firefox dev: XUL/XPCOM = old APIs, WebExtensions = new (multi-browser) API
Quantum was the the project name to re-engineer Firefox internals, with lots of design changes, not just plugins. XUL/XPCOM APIs were dropped, as an occasional programmer I understand why, "Quantum broke my plugins" is a reasonable first approximation for most users.
The problem is that it is a proprietary format. The advantage of the format produced by SingleFile (HTML) is that as long as your browser is capable of interpreting HTML, you will be able to read your archives without worries.
Not so proprietary. It's really just a plist file, which the format is known and even open sourced by Apple[1]. Really it's only proprietary in that no other platforms have implemented it.
> MHTML, (...) is a web page archive format used to combine, in a single computer file, the HTML code and its companion resources (such as images, Flash animations, Java applets, (...)
Does anyone else get two security warnings whenever you try to save an MHTML page using a Chrome extension? I have to click on one warning's button to confirm that I indeed want to save the "dangerous" file and another to confirm I'm really sure. It's gotten very annoying. I've looked all over for an option to disable this behavior but haven't been able to.
I’ve extensively looked into this as I can’t find a good light and easy backup options that isn’t extreme overkill.
I thought MHTML was NOT standardized which is why it wasn’t across all browsers yet. From what I remember, every company was doing their own implementation of it. Maybe it’s gotten more standardized the last few years though.
The big one in my experience is it doesn't play well at all with JavaScript. Single file to my knowledge (I experimented with it briefly) allows all js to load on page and can then embed loaded media as base64. I think it also has heuristics to embed relevant js as well. It still only gets you 90% of the way there, and I came to the conclusion that unless you are doing web archive type work or need audio / video a composite image works well
Can we please stop with the 17MB GIF images used as demos? They use up lots of data immediately as you open the page, and are impractical, you don't know how long the animation is, can't forward/rewind, and you can't press fullscreen on a mobile.
And GitHub supports embedded videos in README.md files, videos are generally smaller than GIF files and their disabled autoplay is a feature = you save your data until you press play.
> GitHub supports embedded videos in README.md files
True since May 2021 so I think a lot of people are still finding this out...
In my experience GIF is still the most set-it-and-forget-it way to know a video will play, to get cross-platform support out of mp4 you may have to provide two different codecs. Anyway, not disagreeing with you and most gifs could drop 90% of their size with better choice of resolution and framerate. This readme is particularly egregious doing a screen capture with scrolling.
As for saving bandwidth until you want to play, I haven't tried this yet but it seems adequately clever to wrap a loading=lazy gif inside a details/summary tag: https://css-tricks.com/pause-gif-details-summary/
See this gist and the code comments [0] - basically you just need to know the magic flags to pass to ffmpeg, transcoding the file with all the right settings.
Not to mention that H264 can take quite a bit of horsepower to decode and play as well (assuming your machine doesnt have a hardware chip specifically for doing just that)
My 2014 Thinkpad X1 Carbon (gen 3) doesn't have hardware transcoding as far as I can tell made Zoom and Discord impossible to use for class, especially because there was no way (that I knew of) to disable all video except the presenter. Even playing a YouTube video on it makes it ramp up.
I'm not sure which CPU you have specifically but the lowest-end model of the X1 Carbon Gen3 has an i5-5200U [1] that lists Intel Quick Sync Video support.
From the wiki page for Quick Sync [2]:
> Intel Quick Sync Video is Intel's brand for its dedicated video encoding and decoding hardware core. Quick Sync was introduced with the Sandy Bridge CPU microarchitecture on 9 January 2011 and has been found on the die of Intel CPUs ever since.
I can't confirm but I'd guess your performance issues lie elsewhere than in the h264 decoding specifically.
If you check out the generation-codec table in that wikipedia article [1], under Broadwell (I believe that's the 5200U's generation name), it says there is support for AVC (which I believe is H264, I'm not a codec wiz), so that's a really good point. I'm not sure why I've consistently had issues with this on my machine then. I wonder if this is something with a configuration on Linux then?
Thanks for pointing that out. I've looked at this table before and payed attention to HEVC, not AVC, so I believe that's where my mistake came from.
Accelerated video decode is often disabled by default on Linux versions of browsers and can be quite dependent on versions of drivers/mesa/X-vs-Wayland/etc.
YouTube by default prefers newer, bitrate saving codecs over old ones if it thinks your CPU can handle software decoding them. On my 2017 Dell XPS 1080p and lower resolutions on YouTube play in software decoded AV1, only 1440p and higher play in hardware decoded VP9, so playing 4K video on YouTube is less taxing for my CPU than playing a 1080p video....
The problem is Zoom and Discord are doing multiple streams. But it really shouldn't be a problem.
H.264, even on the high profile is not CPU intensive on a 2014 machine. Unless you are watching 1080P with 5-10Mbps, which is not the norm for internet video.
Author here, sorry for the GIF file. I created it because people were not happy with the video hosted on Youtube. AFAIK, video files did not work when I did this demo. I'll try to improve this in the future.
I wanted to comment on how useful that demo was to me. It did a great job at demonstrating why this is useful and how well it works compared to the native browser implementation. Thank you both for the demo and for the project!
GitHub only recently expanded video support from gif to decent video formats, and many github enterprise installs don't have those new features yet. So, keep spreading the word.
If the demo sequence is <5 seconds, I have never found myself becoming impatient. Gif is perfect for very brief demos. Anything longer than that and I'd like to have some idea where I am at in the video stream (and other controls as indicated)
I wish browsers came standard, preconfigured with warning dialogs that triggered if assets attempting to load were beyond some threshold. That threshold could be decided by the browser vendors group based on some collection of network statistics and be adjusted on an annual basis or so.
> And GitHub supports embedded videos in README.md files
Any documentation on this? Because I have tried to embed video in issues and PRs before, and did not manage. I'm hoping such documentation will explain how this extends to issues and PRs.
Giving a massive upvote for this, disappointed and confused to see you've been downvoted here. There's literally no reason to use GIFs like this, and - as you stated, it's massively disrespectful to those not fortunate enough to have broadband connections, but would like access to the information.
Using data so wastefully like this always reeks of privilege to me - especially on something like GitHub. Wikipedia, for instance, never allows things like this.
I would be surprised that the author wasn't using WebM to get a smaller filesize (not to mention higher quality) but the project itself leads me to believe that the author has a lot of free disk space to use
There’s no need to make further assumptions about the author (who btw took the time to build a very useful tool and share to in the Internet for free). Just point out the issue of the GIF and move along.
I think you missed something here; parent wasn't insulting the author. They were inferring that someone who made an archiving tool probably has resources that help with archiving (bandwidth means access to documents and storage space means a place to save them-- you get the idea).
I love, love this extension. I am working on an app to turn this into a single click bookmark system on Linux. Run an inotify service to watch your downloads and then process any Single file downloads to a database and update a browsable index.
I think I basically get the idea, what kind of database are you using? Recoll sounds like a good idea, but I'm also thinking about how I might also make this public-ish.
(i.e. I teach in college and would love to have a centralized way to store and search all my assigned readings, which are most often webpages)
Each html page is processed by (1) getting url, title, time saved (this is under-rated as approximate time of saving is useful if you want to rediscover) and then (2) taking a screenshot and finally (3) extracting text with readability.js and hopefully doing some keyword analysis.
Right now it is stored in a local SQLite Database, although the article content is stored in text files. For search, I can use ripgrep to look through the associated text files.
The eventual goal is to create a flask app which will allow for interactive management of the bookmarks (tagging, searching). I've already got static generation of bookmarks.
I archived (privately) some documentation pages from some of our vendors that were behind a login page using that just in case it became inaccessible at a critical time for us.
Maybe a little OT, but founders should take a careful look at this landing page. That's how you sell something. The demo is clear about the problem they're trying to solve and it convinced me that their product actually solves it. It's not just all the information they've included, but also the lack of irrelevant clutter.
I scrolled past the gif because I didn't realize that it was an informative gif. First few seconds looked like just an animated logo and I never stayed to watch it. Or it could've just started with an action instead of animating the logo.
Related: I used to keep a collection of locally mirrored web pages a long time ago, with a legendary Firefox extension called ScrapBook [0] (now long retired). The surprise for me is that after all these years I still remembered the name...
While writing this comment I found that it lived on as a (now "legacy") new extension named ScrapBook X [1], and then yet another one named WebScrapBook [2], which seems to still be alive!
What a cool project! I love the way this embeds images. One of things I miss most, though, when going back to old sites, is embedded audio or video. From looking at the options, it seems like it might be able to handle encoding video and/or audio as Data URIs, but it's not totally clear if SingleFile does this or not. I wasn't sure if I was doing the correct things to force this behavior in the options. It would be great if the README could clarify how these are handled by SingleFile. Sometimes it might be nice to be able to embed these sorts of things, even if it does make the HTML ridiculous and bloated. Or, barring that, maybe just a recommendation to use one of the other formats in the comparison table for this kind of use case.
Unfortunately that won't allow you to click links in your offline version. you can do this properly with wget:
(sorry I don't know how to do code formatting in hackernews)
Are you suggesting to mirror e.g. the entire Wikipedia through wget?
That is not only suboptimal, it is stressing on the server. At least you added a --wait=2, but on any large site/hoster/CDN, this might still get your IP banned or throttled. And on e.g. the English wikipedia this will then take 149 days. Which means that by the time you hit the last page, the first ones (and their links) are out of date.
If you add '--no-parent' (doesn't request anything that's not a page dependency above the requested URI) and a '--level=5' (only follows link 5 deep), you won't get all of a site. It makes it more realistic for getting wikipedia articles.
Do you mean converting links so that they point to your local copy of other pages from the same site? Yeah it doesn't do that; it isn't a tool meant for archiving whole websites. It's just for archiving a single page, with all the content baked into it. Like a bookmark but doesn't linkrot.
Zotero deals with this reasonably well—and happens to be using SingleFile under the hood. Its landing page just targets a specific audience (academics), which means probably upwards of 90% of the people who would happily use it probably end up bouncing after thinking, "This isn't for me", before ever trying it. Give it a shot.
Ahh damn, should fix it!
For now, you can edit the URL manually to take a peek.
If you're interested, feel free to send a DM on Twitter @recursiveSwings, I'll let you know once it's in Beta! :)
This is great. I've always wondered why this isn't the default behaviour for page saving in browsers. To an ordinary user saving a page implies saving a single file, not a file plus a directory of stuff. HAR can be useful but seems only for niche or specialised reasons.
Add SingleFile? This extension will have permission to:
- Access your data for all websites
- Input data to the clipboard
- Extend developer tools to access your data in open tabs
- Download files and read and modify the browser's download history
- Access browser tabs
... all of which I don't mind, as long as the extension can't exfiltrate any of the data it can access (send it to a third party). i.e.
- no network connections from the extension
- no modifying web pages or executing any code in the context of a web page
Does anyone know whether these sorts of extensions can exfiltrate data? -- this is a concern if the project author's credentials are stolen by a threat actor, which has happened before.
Security question: Is a web extension safe if it is installed but if you're not using it at the moment? For example, if I were logged into my bank's website and I did not click the SingleFile button in the extension toolbar, could it still theoretically collect info from my bank's webpage or do other actions?
I'd like to use SingleFile and have no reason at all to distrust it, but I'd like to understand the security impact of installing lots of web extensions. How do people handle security risks like that? Do you run a separate vanilla browser with no extensions for sensitive tasks?
For technical reasons beyond my control, SingleFile injects a (very small) script when the page loads even if you don't click on the button. It could also send any data to a third party server. Unfortunately, it is therefore impossible for me to technically and formally guarantee that SingleFile cannot behave maliciously.
Note however that the extension has the status "recommended" on Firefox and that it undergoes a manual code review by Mozilla at each update.
On Chrome you can go into extension settings and adjust permissions so SingleFile only has permissions “on click”. Then it won’t/can't inject that little JS snippet into a page until you actually want to use the extension. The only downside is that after enabling you then have to refresh the page for the extension to do its work.
I wish this behavior was more well known and encouraged by Google.
If you care about security, consider using Qubes OS with hardware-virtualized VMs for compartmentalization. Then, you Firefox for banking won't have the same extensions which you use elsewhere. Works for me.
How old is that demo gif? I just tried reproducing the normal saving shortcomings, and the bottom image ("Example of an SVG image with embedded JPEG images") loads just fine from the local folder, so this seems outdated.
That being said, it's a bit weird that this kind of tool is even necessary at all. I would have expected native saving to include CSS background graphics as well, but apparently they don't for some reason, so I think this is pretty useful. Until now, I have also used pandoc (--standalone) to merge all resources into a single HTML file which worked great.
We really, really need Web Bundles to progress and fix these problems correctly, once and for all. There are a lot of things that a tool like this can never get right, and the rest is complicated work that should never need to be done if we have a standard multi-file bundle format.
I was hoping this tool also solved a problem that comes from saving & reproducing JS-framework-heavy websites.
Here's the bug: According the HTML spec, elements like <h2> and <div> cannot be inside <a> tags. But using js you _can_ push <div>s instead of <a>s. (It happens from document.insert-type functions, frameworks like Angular/React allow this)
The way that sites like Wayback Machine handle this is by using the web-replay library Wombat https://github.com/webrecorder/wombat that also uses JS to insert those elements.
But what the hell! I was working on a similar html-downloading/reproducing tool and this bug really bothers me. I'd either like the HTML reading standard to be updated to accept <div> inside of <a>, or also make that impossible to do via JS.
I think this issue could be circumvented by manipulating the page (replacing images, frames, css etc.) in the tab itself (SingleFile does it in background with a DOMParser instance). The trick is to avoid HTML parsing.
Maybe because JS files (specifically add-ons) run from the local filesystem are given escalated privileges compared to normal usage, perhaps for ease of development. I'm just speculating, though.
I think it’s a limitation on all extensions applied by Chrome/Firefox. My guess is to stop extensions from making you force install more extensions or something...
(Also what’s up Andrew! YC S09 represent :wave_emoji:)
Does this simply remove the JavaScript or do something more clever? Because I think in the age of SPAs, the proper way to save "content pages" might be to execute the JavaScript once and serialize the resulting DOM back to HTML. I didn't find anything in the FAQ that explains if it does something like that.
It saves what you see (and remove JS by default). There is an option to embed the JS and another one to save the "raw" page but I would not say it is reliable.
The cleverness lies more in the ability to produce light pages.
That's a nice and simple tool, good work. I'm personally using Zotero to save copies of web pages: https://www.zotero.org/. With the browser extension you can save a snapshot in a few seconds.
I love monolith too, especially because it is so easy to modify. Only 2300 sloc!
Sure, it uses libraries to do the heavy lifting, but these are all popular, well-tested libraries with well-scoped feature sets (html5ever for parsing HTML, url for parsing urls, etc).
If you're looking for a tool like this but think you might need to tweak it, you should give monolith a try.
How to toggle reader mode/readability?
It doesn't seem to be able to save pages when I toggled chromium's reader mode on.
I followed the other advice on this thread. In the options:
- Annotation editor > default mode > format the page
- Annotation editor > annotate the page before saving
It automatically format the page into reader mode then I can click "Save the page" icon to save it.
But sometimes I want to download the page as is. Like this thread for example. "Restore all removed elements" button doesn't seem to work to revert the changes.
For now I just set default as mode as normal and enable "annotate the page before saving", and then click "Format the page for better readability" when needed.
You could also create 2 separate profiles in the options page. One profile would open the annotation editor and the other would not. Then, you would just have to save pages with the appropriate profile. To restore the page, you should click on the "format page" icon.
Thank you. I'll give multiple profiles a try. As mentioned by some of the comments. It's a good tool for managing bookmarks. ArchiveBox supposedly does something similar but I couldn't make it work.
Thank you for the feedback! Regarding the bookmark management, it could certainly be better. I would have to find some time to code a bookmark manager extension based on SingleFile maybe.
Nice project! This project, and a similar project called Monolith[0], was a bit of an inspiration for making my own single HTML file tool called Humble[1] to solve a few edges cases I was having with bundling pages (and since I wanted a TypeScript API for making page bundles).
I'm building a tool for people have a personal archive to their digital life so that 30 years from now they can revisit content they enjoyed in their younger years.
I confirm that you could use a headless browser for this. This is actually what SingleFile CLI does [1]. Here is an example of JS code showing how to configure and inject SingleFile with puppeteer [2].
Love this. Use it all the time. Handy for saving huge pages with all the styling intact for reading offline (like on a plane). You could save a webpage as a PDF, but I prefer this over a PDF.
Even after reading the praise here I wasn’t prepared for how good and useful this extension is. It’s a perfect solution for saving local copies of web pages. I do this frequently, and am surprised I didn’t know about this until today. Even the way it handles settings for the extension is great, with good, built-in documentation. The ability to add annotations is icing, and since they become part of the HTML there is no lock-in or special file format needed.
Does it have the option to automatically save every page you navigate to? There were some extensions back in the 2000s ("slogger" I think was one, "shelve" or something similar was another) but I don't think they work any more. The pages I think to save now are never the ones I want to look at 5 years down the road.
Is there a way to use a version that requires less of these permissions? e.g. it seems we can address the first permission by only activating it on click, but I'm not sure if that addresses the other ones.
I try to use optional permissions as much as I can. The first permission is required because of assets and frames stored on third-party servers. The second permission should be optional, I don't remember why it's not. I'll try to see if I can make it optional. The last permission is required in order to save the page on the filesystem with the "downloads" API. Note that even if I make these permissions optional, you might still have to trust me anyway ;)
In the olden days, Internet Explorer used to allow you to do this by saving the page to a HTM file. It would be a single archive with HTML and images etc embedded.
New browsers don't seem to do this, the create a separate folder for the assets, which is super annoying.
Dang it, he beat me to it! I have been toying with the idea for quite some time, but this implementation is great, better than mine would have been, so I'm glad he did it.
Maybe I'll make a CLI implementation (sorta like wget but with this tacked on...)
The difference is the output format. I created SingleFile before Chrome supported MHTML files. At that time, to save web pages in a single file, the only technical solution in Chrome was to implement something like SingleFile. The advantage of HTML is that this format is much more durable though.
Yes, there is .mhtml but it execution plainly sucks because it didn't exactly saves everything. It would attempts to save but it won't be valiant at it, it is like using mhtml without "force (-f) argument".
For some reason, I went in expecting to see a JS-enabled multi-page web site into a SPA in a single HTML file, but I didn't expect to see images get embedded.
Perhaps offer a recursive traversal option too, but don't try that on Wikipedia :)
Back in the day this was always one thing that had me begrudgingly and shamefully opening IE so I could save a page as an MHT file. So long ago now. Cool to see this idea has been revived and not in a proprietary way
I love SingleFile and have been using it for years! Is there any version that works on current mobile browser versions? I've stuck with an old version of Firefox on Android that still supports the extension.
Thanks for this. I expected to see a pricing link somewhere, having been attuned to all the subscription Saas these days. Glad to see there are tools offering immense value for free still.
I've been using it for a couple years (2 maybe) and I like it quite a bit as a quick and easy way to save pages. ArchiveBox looks fantastic, but I just don't have the motivation to set up the service and maintain it since I don't save enough links to make it worthwhile. SingleFile might be worth a shot, but it looks like WebScrapBook has been handling your needs just fine (they seem to have 90% of the same functionality).
Most of the time, it will be able to deduplicate them. For example, if you save this page https://groups.google.com/a/chromium.org/g/chromium-extensio..., each different avatar will be embed only once. To achieve this, SingleFile stores the content of duplicate images in CSS custom properties, displays them as background images in the IMG tags and uses a (properly sized) transparent SVG image as SRC. Thus, stylesheets are not broken.
In fact, SingleFile implements several ways to save space. In practice, the most effective mechanism is shaking the CSS tree.
It's amazing how much CSS is useless in a page. It's especially annoying for SingleFile if it contains images... That's why SingleFile removes (almost) all unused rules, selectors and CSS properties by calculating the CSS cascade.
I'd also recommend "Print Edit WE" and "Save Page WE" [2] for Chrome type browsers, both by one author. First one allows for editing of the page before printing/saving (as a single page HTML or MHTML), second one allows for single-page save.
[1] https://github.com/gildas-lormeau/SingleFile-Lite
[2] https://github.com/gildas-lormeau/SingleFileZ