The Wayback Machine: Fighting Digital Extinction in New Ways

tpmx · on Oct 18, 2019

I'm really curious about how the creators of the Wayback machine are working to save modern (perhaps sometimes somewhat unnecessarily overcomplicated) web pages that are using SPA "techniques". Have they implemented a googlebot-like crawler that interprets javascript and spits out.. some predigested final DOM tree? Or.. do you record all web-page-initiated network traffic and just let it replay, sort of? Lot of interesting research opportunities here, btw.

This is where archival meets browser/web tech, in a kinda complicated way. I would hope that people from both of these backgrounds have been working on this stuff together. If not, please start soon.

dvanduzer · on Oct 18, 2019

A crawler has two high level options: parse the page, or render the page.

Most of our parser-based crawling is done by Heritrix (crawler.archive.org) and most of our render-based crawling is done by a proxy-based recorder similar to what you theorize (https://github.com/internetarchive/brozzler).

tpmx · on Oct 19, 2019

Thanks for sharing. That lets me sleep a bit easier.

dredmorbius · on Oct 18, 2019

Custom archivers, in many cases. The ArchiveTeam are a part of that (they're independent of IA, but work closely with them).

Feedback to info@archive.org also generally nets responses, though you may want to give that a week or so for them to get back to you.

tpmx · on Oct 18, 2019

I think this is a challenge that the archive.org/wayback machine people need to tackle to keep the general web page indexing working, going forward.

To me it's kind of obvious that "custom archivers" approach isn't going to scale to the entire web. "Customer archivers" implies some set of rules per "site engine". The problem is that that are probably (guessing here) 1M+ unique "site engines", or whatever you call them. I.e. the web is way too diverse.

There probably needs to be something generic that works with complicated pages 95% of the time, without previous knowledge of the API structure/dynamic.

If "custom archivers" is the plan to tackle the archival of the web for the past five and the next fifteen years.. I really think we need to do better.

dredmorbius · on Oct 19, 2019

Disclaimer: I'm not directly involved in any of this, though I've followed some general discussions.

1. The Web is much less diverse than you think it is, for two reasons.

2. Zipf/Power functions: the largest sites are a huge part of the Web. If people are spending (say) an hour online per day, and 40 minutes of that are FB/Insta, then all the rest of the Web gets 20 minutes. And much of that will be the next most popular site. Attention is fundamentally limited. In terms of experience, the Web is fairly small.

3. Which of course isn't content, but here again, much of the original content only hits a relatively small number of sites. I have looked into news site stats, and what you'll see for national papers of record (NYT, WSJ, WashPo, etc.) is about 150-500 items per day (Sundays are highest). The news services (AP, UPI, AFP, Reuters) tend to run 1,000 - 5,000 items/day. User-generated content sites are obviously higher than this, though much of that content is derivative (sourced/linked elsewhere).

4. Web engines. Even with a diversity of sites, there are a limited number of systems driving this. I've been noting a consolidation around news sites of a number of standard templates, and of course FB, YT, Instagram, Reddit, etc., will get you a bunch of specific Web engines. Which means it's less "one site per archiver" and more "what sites does this archiver work on?". And worse: over what period of time -- just because an archiver works on www.example.com from 2018-1-1 to 2018-12-31 doesn't mean it works from 2019-1-1 onward.

There are a few other approaches. Archive.today (archive.is / archive.fo) basically translate pages to their own representation. There are standardised presentation engines based on Readability (now defunct as a site, though its parser remains), used by Mozilla's Pocket and other tools (InstaPaper, etc.).

5. I also suspect we may see at least some normalisation around HTML5 (or subsequent) elements. Which I'd really like to have happen. NY Times, for example, has abandoned <table> for its own custom, and horribly broken, table-presentation alternative crap.

6. Yes, this leaves one-off SPAs which may be hard to archive. I don't know what the plans are for that.

(Now to see what IA have had to say about any of this.)

Update: from an IA response in this thread, "brozzler":

https://github.com/internetarchive/brozzler

8bitsrule · on Oct 18, 2019

FF frequent-users of Wayback may want to add the Wayback Machine add-on to their toolbar. Along with 'first','recent' and 'overview' selections it includes 'Save Page Now', as well as related Alexa, Whois and Twitter connects.

https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...

jolmg · on Oct 19, 2019

I see that it's licensed under the GPLv3, but where's the source?

EDIT: Maybe it's this one[1], but it's under a different license, AGPLv3. The repo also hasn't been updated since 2016, but the extension page says last update was in 2018. Are the changes and re-licensed source elsewhere?

[1] https://github.com/internetarchive/wayback-machine-firefox

thrwn_frthr_awy · on Oct 18, 2019

Does the Wayback Machine have a long term plan that anyone is familiar with? Is their goal to preserve the web indefinitely? Is the hope the storage and compression improvements over time keep up with content creation?

And just to be clear, I think the Wayback Machine is great and the fact that I can lookup my personal, basically zero traffic website from 15 years ago and see it is truly astonishing to me–I'm just curious what this look likes in 10, 20, 50 years.

toomuchtodo · on Oct 18, 2019

The best way to get this question answered would be to email or tweet at Brewster Kahle, who started and heads the Internet Archive.

If you're in San Francisco, the Internet Archive is also hosting a block part the evening of Oct 23rd from 5pm-10pm, and staff will be there to answer questions (tickets are $15).

Disclaimer: No affiliation

jonah-archive · on Oct 18, 2019

If anyone's interested in attending our party next Wednesday but the cost is presenting a difficulty, shoot me an email (in my profile) and I'll send you a ticket at no cost.

willis936 · on Oct 18, 2019

Information is not infinitely compressible. In fact, for lossless compression, it’s proven that we can’t do much better than we already do. At some point you just need more storage. However we are several orders of magnitude away from the physical information density limit (surface of a black hole).

meesles · on Oct 18, 2019

> In fact, for lossless compression, it’s proven that we can’t do much better than we already do.

Source? Sounds like one wouldn't know a better alternative exists until one discovers it...

wwwhizz · on Oct 18, 2019

There is a whole scientific field on this, called Information Theory. Compressing entropy is limited by its randomness.

For instance see: https://en.wikipedia.org/wiki/Entropy_%28information_theory%...

and

https://en.wikipedia.org/wiki/Kolmogorov_complexity

throwaway_bad · on Oct 18, 2019

What if most of the data humans care about is not random?

For example to recover the state of an entire simulated universe you just need the value of the initial seed and the generator.

dredmorbius · on Oct 18, 2019

Different classes of data compress differently.

For complex reason, human language (spoken and written) is about 50% redundant, across a wide range of independent languages.

Tabular data can be vastly more compressible, and I'd routinely see 90% or better compression across a range of datasets (mostly business, financial, and healthcare data). Data of highly random events might be somewhat less so.

Image, audio, and video data, when in codecs is already highly compressed. When you're working with raw (WAV, TIFF, BMP, RAW) datatypes, there's a huge opportunity for compression, but mp3, ogg, mp4, png, gif, jpg, etc., are pretty highly compressed. There's a distinction between lossy (jpg, mp3) and lossless (png, AALC) formats. You get smaller files with lossy formats, but you're actually losing some of the original data, whilst lossless codecs allow fully reconstruction of the original source image, audio, or video.

Your comment about simulated universes gets to a key philosophical question about information, truth, and models. Generally, any representation we have of the universe is at best an abstraction of it, and hence a small, lossy, model.

This needn't necessarily be the case:

https://en.wikipedia.org/wiki/On_Exactitude_in_Science

strbean · on Oct 18, 2019

I think statements about theoretical limits on compression are ignoring emergent properties. We know you can "compress" certain things infinitely; for example, the Mandelbrot set.

Compressing arbitrary inputs using emergent properties may never be practical, but in it seems reasonable that you could trade computation for compression to an arbitrary extent (searching an emergent series for chunks of data that match your input).

willis936 · on Oct 18, 2019

No information we capture has infinite precision anyway, so fractal based compression falls under lossy compression. It is a much more complicated task to identify fundamental limits on lossy compression performance and an even harder task to have a collective agreement on "good enough" for a given purpose.

throwaway_bad · on Oct 18, 2019

I was actually thinking more about natural languages than fractals. Maybe human thought is so utterly derivative you can just generate a random stream of words and it will contain most of the text that humanity will ever produce.

Then it can be compressed down to an index into the libraryofbabel

Aperocky · on Oct 18, 2019

It’s basically conservation of energy/entropy being the reason. As in you cannot violate physics.

magashna · on Oct 18, 2019

Save Page Now seems huge. It's a real bummer going to old forums for an obscure hobby or fandom only to find all the text and none of the images, music, etc.

btrettel · on Oct 18, 2019

I've had the exact same problem before.

I used to run an old forum that's now just an archive and I've been really meaning to download all the images linked in the posts in case they go offline. At some point during the forum's run I added a file upload feature which seems to have helped a lot (by avoiding external dependencies), but did not solve the problem. Fortunately I believe I have many of the missing images saved, but there very likely are important things missing.

I'm planning to launch a new forum next year and I think I'm going to write a script to periodically archive all images and links posted to the forum. I might make external images not allowed, though that seems rather extreme and might just make people post a link rather than use the upload feature.

duskwuff · on Oct 18, 2019

Tangent:

There's an incredible amount of information stored in obscure web forums, often in posts with photos. The damage that services like Photobucket have done by deleting old files, or by restricting hotlinking, has been incalculable. I worry that Imgur has the potential to do even worse damage, as so many forum users have converged on their service after others became unavailable.

(Imgur's popularity with Reddit users leaves Reddit highly vulnerable as well.)

jborichevskiy · on Oct 18, 2019

Crazy idea: an browser extension a user can install which downloads images as they come across them in their browser and uploads them to something distributed - perhaps built on top of IPFS? Users could choose which domains it would be active on. The network could be split up by either domain or topic (say, people interested in diagrams of space which might include several domains/sites).

Just thinking out loud here.

duskwuff · on Oct 19, 2019

The problems with putting that kind of data in any sort of distributed service are that:

1) It depends upon enough users being able to consistently contribute a lot of storage to the system. It turns out that this is hard. Casual users are actually a hindrance, because they'll suck up a bunch of bandwidth trying to replicate data, then drop out of the swarm forever.

2) The service will inevitably be used to host illegal pornographic content. Without some sort of centralized control, there's no way to stop this, making participation legally problematic.

toomuchtodo · on Oct 18, 2019

Run grab-site [1] periodically (with the --no-offsite-links flag) and upload the resulting WARC files into an item in the Internet Archive. They can then be ingested by Wayback. If you prefer, I can do this for you as part of my existing archival operations.

[1] https://github.com/ArchiveTeam/grab-site

btrettel · on Oct 18, 2019

Thanks, I wasn't aware of this bot. I'll keep your offer in mind.

dredmorbius · on Oct 19, 2019

SPN can be run from a script using wget, curl, or any other HTTP GET request generator.

If you can come up with a set of URLs for your content, you can archive it by prefixing it with "https://web.archive.org/save"

So if you have:

    https://www.example.com/page-1
    https://www.example.com/page-2
    https://www.example.com/page-3

You'd generate (and request):

    https://web.archive.com/save/https://www.example.com/page-1
    https://web.archive.com/save/https://www.example.com/page-2
    https://web.archive.com/save/https://www.example.com/page-3

This is trivially scripted, or there are a few existing generators.

I've created archives of ~12,000 or so posts, from an old desktop Linux system over modest residential broadband, in less than an hour, running up to 20 parallel requests via xargs or GNU parallel.

For a basic curl-based URL archiver (call once per URL on your list):

    curl -s -I -H "Accept: application/json" "https://web.archive.org/save/${1}" |
        grep '^x-cache-key:' | 
        sed "s,https,&://,; s,\(${1}\).*$,\1,"

toomuchtodo · on Oct 19, 2019

https://github.com/pastpages/savepagenow is useful for this as a packaged utility.

ElijahLynn · on Oct 19, 2019

Got me thinking about how much storage Internet Archive uses. The answer is:

Total used storage: 50 PetaBytes

-=--=--=-=-=-=-=-

https://archive.org/web/petabox.php

A few highlights from the Petabox storage system:

Density: 1.4 PetaBytes / rack

Power consumption: 3 KW / PetaByte

No Air Conditioning, instead use excess heat to help heat the building.

Raw Numbers as of August 2014:

4 data centers, 550 nodes, 20,000 spinning disks

Wayback Machine: 9.6 PetaBytes

Books/Music/Video Collections: 9.8 PetaBytes

Unique data: 18.5 PetaBytes

Total used storage: 50 PetaBytes

ElijahLynn · on Oct 19, 2019

And that got me thinking to how much 50 petabytes would cost...

From https://www.backblaze.com/blog/petabytes-on-a-budget-10-year... on September 24, 2019.

-----------------------------------------------------

Storage Pod 1.0 allowed us to store one petabyte of data for about $81,000.

Today we’ve lowered that to about $35,000 with Storage Pod 6.0.

-----------------------------------------------------

Obviously they paid more in storage in the past than today and it is a different solution but if you were to buy 50 PetaBytes today in one of Backblaze's Storage Pod 6.0s it would be $1,750,000.

And there is ongoing maintenance & costs of drive failures.

I feel the need to donate to Internet Archive soon, as I have greatly benefited from it in the past and am sure to in the future too!

db48x · on Oct 19, 2019

If you poke around you can find some more recent stats: https://catalogd.archive.org/report/space.php

That yearly graph is pretty nice :)

ElijahLynn · on Oct 22, 2019

Nice, thanks for that!

ElijahLynn · on Oct 19, 2019

Love the URL in the example screenshot, whitehouse.gov. Fantastic example of a source that _needs_ to be archived.

tannhaeuser · on Oct 19, 2019

As much as I appreciate Wayback Machine, it's the responsibility of authors to choose an authoring format that can stand the test of time, at least for content you care about. HTML is built on a rich foundation of markup languages which is more than adequate for preservation. Just that it renders in a browser isn't good enough as browsers have turned into overly complex monstrosities with a high risk of loosing further browser code bases going forward (eg. Moz loosing their Google deal, and developing browsers becoming economically infeasible) at which point we're at the mercy of an ad company to even read our documents.

tripzilch · on Oct 21, 2019

Tip: For people using duckduckgo as their default search, if you happen on a site that's no longer available, just type "! wayback " in front of the URL.

I suppose you can also set it up as a keyword search in your favourite browser.

tambre · on Oct 19, 2019

Unfortunate that their crawler doesn't support IPv6. Trying to save IPv6-only websites results in "Couldn't resolve host". Hopefully that'll get fixed soon and not too much will be lost...

class4behavior · on Oct 19, 2019

https://archive.is/ supports IPV6

Whenever possible everyone should be archiving to both anyway.

white-moss · on Oct 19, 2019

Oh, I donated some of my money to them a few days ago. I'm very happy to read so wonderful news like this :) Outlink feature is great! Ultra useful for blog site.

I'm a fan of them.

sizzle · on Oct 19, 2019

Who runs www.archive.is and are they related?

333c · on Oct 19, 2019

They're not related.