We're drowning in data in the present. But I predict that five hundred years fro...

jzb · on Oct 27, 2023

I'm generally pessimistic but I don't think this is true. We're still generating print information at an astounding rate. Plenty of organizations work on archiving information in various forms.

Historians will have plenty of information to sift through from this era in 500 years. Will they have a complete collection of ACM Journals? Perhaps not. Will they have ample information to get a clear picture of society from this era, and clear timelines of events, etc.? I would say yes, better than any other time so far.

I think people conflate "a lot of information won't survive in 500 years" with "we're going to lose everything."

I have literally thousands of pictures on my phone / in cloud storage. Odds are none of them will survive 500 years. That's OK, from a historical perspective -- 95% of them are cat pictures or memes anyway. 5% might document something interesting if it was all a historian had to puzzle together a picture of what life was like in the 2020s.

But it won't be, because we're literally producing trillions of digital artifacts. If even 1% (or probably even .1% or .01%) of those survive they'll have a richer visual representation of the 2020s than we have of other times in history.

(Whether we'll have any historians or humans in 500 years time, that's the real question.)

jacquesm · on Oct 27, 2023

I'm not saying that we are not generating print information. I'm saying that we have a very low signal to noise ratio to the point that even if we do have a lot of information the chances that the 'good stuff' will be preserved are getting smaller by the day to the point that it will be essentially drowned out by junk unless we take special measures in the present.

rbanffy · on Oct 27, 2023

> But you'll have a million AI generated recipes for Apple Pie from spam websites of the era to console you.

And good luck finding the relevant information in the middle of all that noise.

titzer · on Oct 27, 2023

This is exactly the problem; information isn't necessarily being lost, but silted over and sometimes intentionally buried. Part of it is due to its natural loss of relevance, part of it is due to loss of popularity and attention, and part of it is deliberate commercial motive.

bdw5204 · on Oct 27, 2023

Information is being lost constantly on the internet. Whether it is CNET deleting old articles as an SEO tactic[0], domain names expiring, formerly popular sites like Geocities being erased altogether, Google mismanaging its Usenet archive or once popular blogs getting deleted for TOS or account inactivity issues, the internet is certainly not forever. Archive.org isn't really a solution either because it is not uncommon for domain squatters to use a robots.txt setting to get them to remove the domain from the Wayback Machine. You can't even rely on large social media platforms because people delete their accounts, some people auto-delete their old social media posts and platforms decide to login-wall themselves like what happened with Twitter.

Link rot is a major problem that people don't recognize, especially for information that was only ever online. Most of the obscure web sites I used to read and hang out on are gone and many of the things I remember are now completely unverifiable because I didn't save a copy of every web site that ever influenced me.

My own unfinished game project from my teenage years vanished from the internet without a trace after I lost interest and I lost all of the code along with all of my other data from my teen years in a hard drive crash around the time I finished high school. My mods I made for games and never distributed are sitting on old laptops in my closet I haven't even turned on in years that may or may not even work anymore. I imagine everybody else who's been heavily online in the past has similar stories of just how ephemeral digital information is.

[0]: https://www.theverge.com/2023/8/9/23826342/cnet-content-prun...

Intralexical · on Oct 27, 2023

> Archive.org isn't really a solution either because it is not uncommon for domain squatters to use a robots.txt setting to get them to remove the domain from the Wayback Machine.

Do they delete it? My understanding is that they simply unpublish it— Lost from the Internet, then, but not necessarily forever.

> and I lost all of the code along with all of my other data from my teen years in a hard drive crash around the time I finished high school.

Technically that data wasn't lost for good either with the hard drive crash. Provided there's an academic, personal, economic, cultural, etc. incentive to read it, I'm sure any old inflation-adjusted $50 magnetic microscope from the year 2080 would have been able to get it all back in a matter of moments.

Overall, I agree with your point. LOCKSS (the principle, not the project) and KISS, and checksum and ECC, etc. HD-Rosetta/NanoRosetta's cool but doesn't seem super scalable or readable, MDisc was exciting but was also a market flop, and Memory of Mankind's ceramic tablets and the Arch Mission Foundation's glass hologram thingies have even bigger practicality problems— For now, so long as digital storage availability increases exponentially, you can probably just spin up Borg or something and keep accumulating backups of old files indefinitely.

But overall, anything that you don't actively invest the overhead to save can be assumed to be lost.

rbanffy · on Oct 27, 2023

> information isn't necessarily being lost, but silted over and sometimes intentionally buried.

And that kind of touches disinformation campaigns. There is a lot of noise being deliberately and maliciously added so that it out represents any information someone wants to suppress. An AI model trained on this corpus will have all the wrong ideas.