I'm really curious about how the creators of the Wayback machine are working to save modern (perhaps sometimes somewhat unnecessarily overcomplicated) web pages that are using SPA "techniques". Have they implemented a googlebot-like crawler that interprets javascript and spits out.. some predigested final DOM tree? Or.. do you record all web-page-initiated network traffic and just let it replay, sort of? Lot of interesting research opportunities here, btw.
This is where archival meets browser/web tech, in a kinda complicated way. I would hope that people from both of these backgrounds have been working on this stuff together. If not, please start soon.
A crawler has two high level options: parse the page, or render the page.
Most of our parser-based crawling is done by Heritrix (crawler.archive.org) and most of our render-based crawling is done by a proxy-based recorder similar to what you theorize (https://github.com/internetarchive/brozzler).
I think this is a challenge that the archive.org/wayback machine people need to tackle to keep the general web page indexing working, going forward.
To me it's kind of obvious that "custom archivers" approach isn't going to scale to the entire web. "Customer archivers" implies some set of rules per "site engine". The problem is that that are probably (guessing here) 1M+ unique "site engines", or whatever you call them. I.e. the web is way too diverse.
There probably needs to be something generic that works with complicated pages 95% of the time, without previous knowledge of the API structure/dynamic.
If "custom archivers" is the plan to tackle the archival of the web for the past five and the next fifteen years.. I really think we need to do better.
Disclaimer: I'm not directly involved in any of this, though I've followed some general discussions.
1. The Web is much less diverse than you think it is, for two reasons.
2. Zipf/Power functions: the largest sites are a huge part of the Web. If people are spending (say) an hour online per day, and 40 minutes of that are FB/Insta, then all the rest of the Web gets 20 minutes. And much of that will be the next most popular site. Attention is fundamentally limited. In terms of experience, the Web is fairly small.
3. Which of course isn't content, but here again, much of the original content only hits a relatively small number of sites. I have looked into news site stats, and what you'll see for national papers of record (NYT, WSJ, WashPo, etc.) is about 150-500 items per day (Sundays are highest). The news services (AP, UPI, AFP, Reuters) tend to run 1,000 - 5,000 items/day. User-generated content sites are obviously higher than this, though much of that content is derivative (sourced/linked elsewhere).
4. Web engines. Even with a diversity of sites, there are a limited number of systems driving this. I've been noting a consolidation around news sites of a number of standard templates, and of course FB, YT, Instagram, Reddit, etc., will get you a bunch of specific Web engines. Which means it's less "one site per archiver" and more "what sites does this archiver work on?". And worse: over what period of time -- just because an archiver works on www.example.com from 2018-1-1 to 2018-12-31 doesn't mean it works from 2019-1-1 onward.
There are a few other approaches. Archive.today (archive.is / archive.fo) basically translate pages to their own representation. There are standardised presentation engines based on Readability (now defunct as a site, though its parser remains), used by Mozilla's Pocket and other tools (InstaPaper, etc.).
5. I also suspect we may see at least some normalisation around HTML5 (or subsequent) elements. Which I'd really like to have happen. NY Times, for example, has abandoned <table> for its own custom, and horribly broken, table-presentation alternative crap.
6. Yes, this leaves one-off SPAs which may be hard to archive. I don't know what the plans are for that.
(Now to see what IA have had to say about any of this.)
Update: from an IA response in this thread, "brozzler":
FF frequent-users of Wayback may want to add the Wayback Machine add-on to their toolbar. Along with 'first','recent' and 'overview' selections it includes 'Save Page Now', as well as related Alexa, Whois and Twitter connects.
I see that it's licensed under the GPLv3, but where's the source?
EDIT: Maybe it's this one[1], but it's under a different license, AGPLv3. The repo also hasn't been updated since 2016, but the extension page says last update was in 2018. Are the changes and re-licensed source elsewhere?
Does the Wayback Machine have a long term plan that anyone is familiar with? Is their goal to preserve the web indefinitely? Is the hope the storage and compression improvements over time keep up with content creation?
And just to be clear, I think the Wayback Machine is great and the fact that I can lookup my personal, basically zero traffic website from 15 years ago and see it is truly astonishing to me–I'm just curious what this look likes in 10, 20, 50 years.
The best way to get this question answered would be to email or tweet at Brewster Kahle, who started and heads the Internet Archive.
If you're in San Francisco, the Internet Archive is also hosting a block part the evening of Oct 23rd from 5pm-10pm, and staff will be there to answer questions (tickets are $15).
If anyone's interested in attending our party next Wednesday but the cost is presenting a difficulty, shoot me an email (in my profile) and I'll send you a ticket at no cost.
Information is not infinitely compressible. In fact, for lossless compression, it’s proven that we can’t do much better than we already do. At some point you just need more storage. However we are several orders of magnitude away from the physical information density limit (surface of a black hole).
For complex reason, human language (spoken and written) is about 50% redundant, across a wide range of independent languages.
Tabular data can be vastly more compressible, and I'd routinely see 90% or better compression across a range of datasets (mostly business, financial, and healthcare data). Data of highly random events might be somewhat less so.
Image, audio, and video data, when in codecs is already highly compressed. When you're working with raw (WAV, TIFF, BMP, RAW) datatypes, there's a huge opportunity for compression, but mp3, ogg, mp4, png, gif, jpg, etc., are pretty highly compressed. There's a distinction between lossy (jpg, mp3) and lossless (png, AALC) formats. You get smaller files with lossy formats, but you're actually losing some of the original data, whilst lossless codecs allow fully reconstruction of the original source image, audio, or video.
Your comment about simulated universes gets to a key philosophical question about information, truth, and models. Generally, any representation we have of the universe is at best an abstraction of it, and hence a small, lossy, model.
I think statements about theoretical limits on compression are ignoring emergent properties. We know you can "compress" certain things infinitely; for example, the Mandelbrot set.
Compressing arbitrary inputs using emergent properties may never be practical, but in it seems reasonable that you could trade computation for compression to an arbitrary extent (searching an emergent series for chunks of data that match your input).
No information we capture has infinite precision anyway, so fractal based compression falls under lossy compression. It is a much more complicated task to identify fundamental limits on lossy compression performance and an even harder task to have a collective agreement on "good enough" for a given purpose.
I was actually thinking more about natural languages than fractals. Maybe human thought is so utterly derivative you can just generate a random stream of words and it will contain most of the text that humanity will ever produce.
Then it can be compressed down to an index into the libraryofbabel
Save Page Now seems huge. It's a real bummer going to old forums for an obscure hobby or fandom only to find all the text and none of the images, music, etc.
I used to run an old forum that's now just an archive and I've been really meaning to download all the images linked in the posts in case they go offline. At some point during the forum's run I added a file upload feature which seems to have helped a lot (by avoiding external dependencies), but did not solve the problem. Fortunately I believe I have many of the missing images saved, but there very likely are important things missing.
I'm planning to launch a new forum next year and I think I'm going to write a script to periodically archive all images and links posted to the forum. I might make external images not allowed, though that seems rather extreme and might just make people post a link rather than use the upload feature.
There's an incredible amount of information stored in obscure web forums, often in posts with photos. The damage that services like Photobucket have done by deleting old files, or by restricting hotlinking, has been incalculable. I worry that Imgur has the potential to do even worse damage, as so many forum users have converged on their service after others became unavailable.
(Imgur's popularity with Reddit users leaves Reddit highly vulnerable as well.)
Crazy idea: an browser extension a user can install which downloads images as they come across them in their browser and uploads them to something distributed - perhaps built on top of IPFS? Users could choose which domains it would be active on. The network could be split up by either domain or topic (say, people interested in diagrams of space which might include several domains/sites).
The problems with putting that kind of data in any sort of distributed service are that:
1) It depends upon enough users being able to consistently contribute a lot of storage to the system. It turns out that this is hard. Casual users are actually a hindrance, because they'll suck up a bunch of bandwidth trying to replicate data, then drop out of the swarm forever.
2) The service will inevitably be used to host illegal pornographic content. Without some sort of centralized control, there's no way to stop this, making participation legally problematic.
Run grab-site [1] periodically (with the --no-offsite-links flag) and upload the resulting WARC files into an item in the Internet Archive. They can then be ingested by Wayback. If you prefer, I can do this for you as part of my existing archival operations.
This is trivially scripted, or there are a few existing generators.
I've created archives of ~12,000 or so posts, from an old desktop Linux system over modest residential broadband, in less than an hour, running up to 20 parallel requests via xargs or GNU parallel.
For a basic curl-based URL archiver (call once per URL on your list):
Obviously they paid more in storage in the past than today and it is a different solution but if you were to buy 50 PetaBytes today in one of Backblaze's Storage Pod 6.0s it would be $1,750,000.
And there is ongoing maintenance & costs of drive failures.
I feel the need to donate to Internet Archive soon, as I have greatly benefited from it in the past and am sure to in the future too!
As much as I appreciate Wayback Machine, it's the responsibility of authors to choose an authoring format that can stand the test of time, at least for content you care about. HTML is built on a rich foundation of markup languages which is more than adequate for preservation. Just that it renders in a browser isn't good enough as browsers have turned into overly complex monstrosities with a high risk of loosing further browser code bases going forward (eg. Moz loosing their Google deal, and developing browsers becoming economically infeasible) at which point we're at the mercy of an ad company to even read our documents.
Tip: For people using duckduckgo as their default search, if you happen on a site that's no longer available, just type "! wayback " in front of the URL.
I suppose you can also set it up as a keyword search in your favourite browser.
Unfortunate that their crawler doesn't support IPv6. Trying to save IPv6-only websites results in "Couldn't resolve host". Hopefully that'll get fixed soon and not too much will be lost...
Oh, I donated some of my money to them a few days ago. I'm very happy to read so wonderful news like this :)
Outlink feature is great! Ultra useful for blog site.
This is where archival meets browser/web tech, in a kinda complicated way. I would hope that people from both of these backgrounds have been working on this stuff together. If not, please start soon.