Hacker News new | past | comments | ask | show | jobs | submit login
Archivists are racing to digitise 300 years of newspapers before they crumble (theguardian.com)
89 points by samclemens on July 5, 2015 | hide | past | favorite | 39 comments



It's interesting that people are currently lamenting the fact that a huge chunk of human history will be lost because it's all digital now, while at the same time we are racing to digitize the past to prevent it from getting lost.

It almost seems inevitable that most of the details of human history will be lost regardless....


I don't mean to self-promote, but I've thought a lot about this too and created http://tracket.com as a side project. I haven't actually talked it publicly until right now.

It's sort of a mix between archive.org and google news. It aggregates and archives articles on the web for current events each day. I've been archiving the top stories each day since December 2014.

Part of my interest in this is to preserve human history in the digital age. We already have the means of storing articles, photos, videos and more of current events as they happen. It's a work in progress, but it's something I spend a couple hours each week hoping to improve it over time.

Edit: Take a look at the top stories in January 2015 at http://tracket.com/2015/01


The key advantage with digital is that it's easily copied.

For that advantage to be worth anything, the material has to actually be...copied.

Central repositories for this stuff are fine, maybe even essential. But they need to also be spreading around full copies quite liberally.


> For that advantage to be worth anything, the material has to actually be...copied.

This is they key. I think people are afraid of our all digital world because no one backs anything up.


Also, digital is so efficient it accumulates far faster than older mediums. I find it a very telling pattern. Things are easy and so they get out of control.

When I was a kid, a photo album had something like 50 tiny fading photos, and it covered 5 years or so. Each one was a very high density of memories. Now we have thousands of them. Curious.


In the old days someone put a photo album in a trunk in the attic. Then they got old, forgetful, the kids didn't care about the photos, then the person died. Years later someone else found the photo album.

What happens in this scenario with digital photos? Sure, if someone is careful (and most people frankly aren't... I've seen more than a few people lose all their photos) you can keep backing them up. Until you can't anymore or you don't. Then they are gone.


Except for the things you don't want anyone to ever see; those will propagate mercilessly.


>The key advantage with digital is that it's easily copied.

That and the fact that we can store an entire library of data in a searchable space the size of an ice cube.


The fact that digital media is so easily copied is what immensely scares the copyright/DRM advocates, so let's hope they don't get in the way of preserving history.


Particularly text. Once OCR-ed, the content of the entire building fits on a consumer HD.


I spent a few years at the Library of Congress working on a similar project.

While it's "easier" to lose digital, it's a possibility as opposed to the inevitability of losing physical forms like paper, LPs, or wire spool recordings. Physical degrades, there's no (economically feasible) way around it.

In the case of digital forms of the same media, they're stored in high resolution, lossless formats for archival purposes and then down sampled for day to day use. For example, our image scans were 300dpi tifs while the day to day versions were simple pngs. Audio files had equivalent things.

The vision is that the lossless versions can be updated to the latest and greatest formats over time and the smaller versions can be recreated as needed.


It takes time to make paper, it takes time to make ink. It takes time to make nand gates and flash storage. It takes time to write things down, it takes time to impose a patter of electrostatic charge onto a substrate. Call this t_make.

It takes time for paper and ink to decay. It takes time for charges to leak. It takes time for PGA packages to deteriorate. Call this t_decay.

In general, t_make << t_decay even for paper, and with electronic information, it's more profound. A really good number is this ratio, which in this case, t_make/t_decay is roughly 1/300 (assumes this project will take a year, which is probably high). For electronic backup, I'd imagine it's more like 1/10^6 or so.

That said, some human effort is needed to keep information alive over long period of time. But it's only t_make effort every t_decay period, which actually isn't a lot, on it's face. It's an interesting question though what the maximum amount of data humanity can keep alive given a certain level of economic output. E.g. if you make 10^7 hard drives a year that store 10^12 bits and they each last 10 years, then you can store 10^20 bits indefinitely. (10^7 hard drives x 10 years x 10^12 bits).


1) purposeful digital archival and day-to-day throwaway content generation are very different stories in terms of staying power

2) the loss of print media is happening right now, today, while the loss of digital media is not quite as pressing. "Kicking the can" in a sense.


Here's my take, perhaps a little far fetched.

At some point around 5000 years ago writing emerged. People wrote down things like “such and such is the king of all the world. He is the son of such and such.” If these things survived and we have them today, we call that history. There’s a convention of dating the beginning of history to the first writing and calling everything before that prehistory. The specifics are cloudy and debatable. Does it count if we can’t read it? Does it count if it’s not classifiable as “writing.” If people paint a cave and we understand they were talking about hunting aurochses with spears, then we know there were some people here that hunted aurochses with spear. Does that count as writing?

Vague semantic lines aside, there’s an idea here that’s interesting. When people can write about something as it happens and put that information down, that’s history. The historical age. There’s something to that, an important difference between information communicated to us by people of that time rather than us knowing by studying less direct artefacts.

All this writing, all this history, has been accelerating like most technology. Every century (except for a few bumpy patches) we’ve been mooing more of it, making more history. More places, people perspectives, details. We’ve added pictures (if cave painting don’t count, are paparazzi snaps of Kardashian’s arse out too?) and video, 3d images, seismographic data and a whole lot of other stuff to that pile. History is exploding. Terry’s History Monks must be having fits.

Anyway, say we move forward a bit and say the current trends keep going for that bit. CCTVs are now omnidirectional. The cops finally agreed to wear body cams, but insist that the crooks wear them too and the politicians while we’re at it. Computers can rewind every bit that ever flipped. Since everyone has an ear-chip and a groin-chip installed (google glass was the last attempt at okular recording devices, too invasive), our lives are recorded fully. Psychology finally works because instead of asking about your mother, a guy with a 19th century bohemian accent just becomes you as a suckling babe for a bit and figures out what went wrong. Maybe he reads her old tweets.

I think that’s a new age. I know this gets us into naming convention problems when the next age comes, but I’d like to call it posthistory. The period before people recorded their commentary on events, the period when they did and the period when everything was recorded in full.

I actually do think that’s coming and I do think it will be a profound change. I think we’re in a sort of grey transitional period where not everything is recorded, but sufficiently more than in previous times that we’re back to troubling questions “do symbolic representation systems that only have symbols for tax related concepts like cheese, beer and god-kings count as real writing and therefore history,” but with a more posthistoric flavour and copyright issues. If Maldu of unya

Oh, and I think bit rot will get solved, unless zombies or meteors or something. Shuruppak sent a note to Labek telling him not to screw around with married women 5000 years ago and we can still read it. I think we can preserve “LOL UR mom is crazy!!!!!”


Maybe he reads her old tweets.

http://www.cnn.com/2013/01/07/tech/social-media/library-cong...

Tweets are actually quite small and very compressible, so archiving them all wouldn't take much space; despite Twitter averaging several thousand tweets/s, from the perspective of a machine that's less than 1MB/s and far less than that compressed.


There's an episode of "Black Mirror" that speaks to your "posthistory" idea and it's not pretty. https://en.m.wikipedia.org/wiki/The_Entire_History_of_You


Well i think that is inherent in entropy.

If i understand the physics right, any kind of order/information will eventually decay into chaos. With the ultimate of that being the heat (gradient) death of the universe.


The scond law of thermodynamics says that the total entropy of a system always increases. It doesn't prohibit local entropy decreases. So a part of a system (newspaper, archive, planet) can maintain or increase its order.


Note to sibling commenter, contingencies: your comment is dead for whatever reason, so nobody can reply to it. Might want to check with the HN management to see why.


A major Australian/NZ newspaper company shipped all their photos to the US for digitization. The company ( Rogers Photo Archive) doing the digitization then went bankrupt.

http://www.theguardian.com/media/2015/jun/08/fairfax-media-p...

Especially embarrassing the the NZ branch since they got special permission from the government to ship the photos out of the country. Some of the photos ended up on Ebay.


Out of curiosity, there's a huge number of newspapers already archived on microfilm (unless it's been all thrown away). I'm sure digitizing from microfilm will be a reasonable alternative.


it's mentioned in the article that microfilm is digitised


Hm. Yet to be proven that any digital storage will last anywhere close to that long. At best it will be a continually active process of refreshing the archive and converting the data to whatever storage media and formats are currently in use.


That's not a problem. The problem is in the analog mediums (paper) that will always disintegrate over time. Digital media will not lose fidelity over time at all. In 10,000 years, as long as the backups were maintained, the data will be as good as the day it was made.


If the backups aren't maintained, the data will be instantly, totally, and irreversibly lost. The maintenance on paper involves not leaving it in the rain or setting it on fire; both failure modes shared with digital.


And dealing with the fact that paper is typically destructive and hard to copy.


Moving the scans to new digital media and data formats is the hard part. It takes effort and money, whereas analog formats need no extra work.


This is only partially true: with online storage, it's relatively easy to make bit-for-bit identical copies even if the physical storage medium changes over the years because there's always an overlap period when a technology falls out of favor.

Where it gets expensive is when you neglect to do that and then 50 years from now someone is pulling a Zip disk or LTO tape out of a box and wondering how to read it.

In contrast, analog formats will always lose quality as you copy it so you have a strong incentive to make copies which will last as long as possible. If you get the right material it might be transferable in the future with no work – e.g. high-quality photographic prints on archival-quality stock – or you might end up needing to build exotic equipment which can do things like optically scan records to reconstruct an audio waveform (http://irene.lbl.gov) or deal with media which has disintegrated (https://www.nedcc.org/audio-preservation/irene-blog/2014/08/...). One look through e.g. http://britishlibrary.typepad.co.uk/collectioncare/index.htm... should be enough to see limited a time period “no extra work” is valid for.

The common theme for both formats is that it's critical to maintain the ability to read and make copies. Once something falls out of common usage the cost to rebuild that capacity go up dramatically because you're no longer enjoying mainstream economies of scale and the work will increasingly require skilled technicians using bespoke tools.

This can be particularly bad with digital formats if the use of DRM means that few/no people are legally allowed to create tools during the period where many of the original creators are still available for consultation.


Paying the rent on a building takes extra work. Keeping the roof in good repair so it doesn't leak takes extra work. Maintaining an atmosphere of 14% oxygen takes extra work. Just because we call it maintaining existing stuff instead of making new stuff doesn't mean it's not needed work.


Digital storage probably won't last that long, but if you store it right and keep rotating media (possibly upgrading to better ones as they become available) then the data itself can last as long as someone is putting time into caring for it.


Is there archive storage management software for offline media testing, recovery and rotation? I've seen dvdisaster and git-annex.


Fresh paper archives could be maintained in parallel with digital.


The idea of concentrating all these one-of-a-kind newspapers into one building is crazy. What if it burns down?

> At such low oxygen levels, the contents simply can’t go up in flames.

Famous last words.

> And with standards for the documentation, archiving and accessing of data – official and personal – still being thrashed out,

I don't understand why this is a problem. Scan them to pdf files, and put them on web pages. Let google index them.


It's a basic principle of fire safety engineering – paper requires a concentration of more than 14.1% oxygen to allow combustion to occur.

Considering that paper composes the majority of the mass in that installation, a sustained hypoxic environment at 14% or below is exactly how this system should be designed.

If the papers were separated into separate warehouses, they would still all have the same environmental requirements. Additionally, you require N times more budget, where N is the number of warehouses you've constructed (not to mention the difficulty in querying physically distributed warehouses for information).

> I don't understand why this is a problem. Scan them to pdf files, and put them on web pages. Let google index them.

Before making such broad, sweeping statements, perhaps read up a bit on the principles of information science: https://en.wikipedia.org/wiki/Information_science

Edit: You are editing your comment every few minutes, so I don't know what to reply to anymore.


Ships have hulls to keep the water out, too, but sometimes the water gets in and they sink. I can think of dozens of ways the archive could still burn.

Here's just one: large earthquake breaks open the building, cuts electric power, breaks gas lines, fire starts. Firemen are overwhelmed and give priority to saving civilians in other buildings, stacks of newspapers are at the bottom of their list.

Another: Fire starts in building next door. Wind whipped flames set the archive on fire from the outside. Archive burns down with everything in it.

A third: bunch of militants take over that part of town. Set fire to the archive because they are opposed to history. Not like that has never happened before, like in ISIS controlled Iraq, and the great fire of Alexandria.

It's the classic eggs-in-one-basket scenario.


At least its a good basket, in a reasonable place, with safeguards in place. Two or three 9's more likely to work than...doing nothing.


> Ships have hulls to keep the water out, too, but sometimes the water gets in and they sink. I can think of dozens of ways the archive could still burn.

Sure, but we still build most ships with double hulls (being extra safe is not a bad thing), and when transporting important goods we generally don't split them up among several cargo ships just in case one hits an iceberg.


>> I don't understand why this is a problem. Scan them to pdf files, and put them on web pages. Let google index them. > Before making such broad, sweeping statements, perhaps read up a bit on the principles of information science: https://en.wikipedia.org/wiki/Information_science

By putting the data in a standard form, like pdf, and making it available to anyone, i.e. put it on web pages, then any organization can devise an organizational and retrieval methodology and apply it. It is completely unnecessary for the archivists to do so, and it should not be blocking them in any way.

Google is just one such example.


I have some newspapers that my grandmother saved from world war II. Those were quite unlike todays newspapers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: