From the first two paragraphs: "Supreme Court opinions have come down with a bad case of link rot. According to a new study[1], 49 percent of the hyperlinks in Supreme Court decisions no longer work.
This can sometimes be amusing. A link in a 2011 Supreme Court opinion[2] about violent video games by Justice Samuel A. Alito Jr. now leads to a mischievous error message[3]."
I love the new, "And if you quoted this in the NY Times, will you do a correction for the now changed text?" line. That wasn't there when the NYT story broke, obviously. Very clever.
I publish a print magazine. Up to this point, we've avoided printing URLs in the magazine, because our printed issues may last longer than the links will stay live.
To solve the problem, we're launching our own link shortener this month, and printing our own shortened links in the magazine. That way we can control what happens when link rot sets in, whether redirecting or caching content on our own servers if necessary (and allowed by copyright).
The side benefit is that all our printed links can be easy to read and type, an important usability concern when dealing with print. A little self-referential branding on the shortener doesn't hurt, either.
Except that your shortener service won't last as long as your print issues either. With the original URL there's still the hope that the Wayback Machine is still maintained and had visited the page. It's a hard problem.
I think this is a great idea. The casual reader will type in the short url (giving the publisher tracking and control), but the expansion will be available long as the print issue survives. This still fails if only an article survives (by clipping, photocopy, etc), but that seems like an acceptable tradeoff.
And until it's a common idea it would be helpful to annotate such shortened links so that someone unfamiliar with the publisher will recognize them as having a printed expansion included. Superscript "*" usually refers to a note on that page, so perhaps a superscript ">>"?
Exactly, or even just issue a yearly book containing the entire mapping. That would save printing since most people probably don't currently care about the expanded form, though in the future they would still be able to look it up in the library system. (They could also start doing this now, while they can't just retroactively add the expanded forms to editions they have already printed.)
Does any existing publication do this? Because providing both shortened and full URLs like this sounds like a great (i.e., retrospectively obvious) way to achieve a balance between "UX" of the magazine and longevity.
Yes. CrossRef.org, which is the registration agency for DOIs (digital object identifiers) does this for URLs in basically the entire scholarly publishing industry. This combats link rot by ensuring that the database of metadata is always up-to-date, and with other initiatives.
Probably best to list both the original URL, and some kind of link that has an archived version (either your own, or from the Wayback Machine). And of course include the date as well. With the date and the original URL, you can always use the Wayback machine or some other archival service like WebCite to attempt to find an archived version.
No citation is guaranteed to be readable after a certain amount of time. All citations are merely a reference for identify the resource that you're looking for, and you need to find it yourself in a library or from the publisher. But the publisher may not exist, you may not be able to find libraries with copies, etc. So, you can only do so much.
But if you cite the URL and date, and at least one alternative way to access the content like the Wayback Machine or WebCite or your own, you give people a pretty good chance.
Perhaps there needs to be a peer-to-peer web citation database that publishers, libraries, and individuals can join. When a book or journal contains a web citation, the publisher archives a copy of that page in their database. When a library or individual buys a book, they can pull the archived copies from the publishers. This will ensure that as long as someone out there who has a copy of the book has also maintained their web citation database and is on the network, you can pull the citations from them.
I believe that archiving of publicly available material, as archive.org does and Google's cache does, is considered not to be copyright infringement (as long as it obeys the limits in robots.txt and noarchive meta tags), as it is sufficiently transformative. So this shouldn't be a problem unless someone explicitly requested that the page not be archived, in which case you probably shouldn't rely on it for a citation.
They could easily have a bot that checks for 404's or substantially changed content for each URL in the shortener database, and informs the shortener. When that occurs, they could have the shortener send the user to either a stored snapshot or cached version of the page that they host with a short note explaining that the content of the original link appears to be gone.
I suggest you take a look at the perma.cc initative. There's no reason to reinvent the robust shortlink, and you won't be obligated to run this system forever.
"Once a link is vested, it will never be deleted unless we receive a legitimate legal order to take it down."
That seems odd to me. Could they simply publish their mappings periodically to avoid being able to fully comply with takedowns?
I mean, the article using the link is already published, what legitimate reason could there ever be to destroy a link in it? What would these people issuing the hypothetical takedown do if the author of the paper had just included the link itself?
Perma.cc appears to be targeted at links in published journals:
After the paper has been submitted to a journal, the journal staff checks that the provided Perma.cc link actually represents the cited material. If it does, the staff “vests” the link and it is forever preserved. Links that are not “vested” will be preserved for two years, at which point the author will have the option to renew the link for another two years.
I'm sure they'll be willing to work with other publishers as well. From talking to the project manager, they don't seem to have any goal of restricting this to academic publications.
From what I understand, all the member libraries are storing some part of the cache. They might not all have a complete copy, but the details aren't clear yet.
> we're launching our own link shortener this month, and printing our own shortened links in the magazine. That way we can control what happens when link rot sets in, whether redirecting or caching content on our own servers if necessary (and allowed by copyright).
That seems worse to me. Sometimes you can gain some info by just the hostname, and pages in the URL. Now all you'll see is random numbers.
And are you really going to spend the money to check every url you have ever published? You have to do it by hand you know. A tool can help but won't tell you if the content of the page changed.
If you want to do this properly use global footnote numbering in each issue. Publish the real URL and the page title (very important since you can google for those words). Each url gets a footnote number that is unique in that issue and you print the number near the url.
Readers can lookup a url by the magazine issue and footnote number.
I remember PC World started doing this back in like 2003. Makes sense, I think. But how are you solving the problem any better than a reader can armed with (1) the date on your cover, (2) the original URL, and (3) the Internet Archive Wayback Machine? Are you aggressively caching every page, even if you don’t make the cache public for a number of years?
we did that on folha de s.paulo newspaper in brazil in the late 90's
but mostly to link to the online version of the site the newspaper owned. it was a way to give out more data for a news piece that could not fit in the paper. Usually external urls were printed in the paper anyway.
Do that automatically, and you'll wind up with a lot of broken URLs. Routinely, when fixing broken links on my own site, I punch something in the IA, click on the latest version - and get redirected to the root of some site or other. Not very useful. (The redirect also makes it hard to get back to the URL search results because the IA redirects you so fast that if you simply hit the 'back button', you won't go anywhere.)
What's the additional cost to maintain something like this long-term?
I'm just thinking about articles that could be 10-15 years old (10-15 years from now mind you.), you could have a substantial database of links to maintain and keep updated, not a small task.
I get an empty download file… is that expected? (Must be a Wayback Machine glitch, maybe caused by the surge of popularity, I figure. It was supposed to have content about school shooters)
I'm more worried that the Justice Alito is apparently unaware of the phenomenon of vaporware:
There are games in which a player can take on the
identity and reenact the killings carried out by
the perpetrators of the murders at Columbine High
School and Virginia Tech.
But both the news article and the now defunct website he cites clearly state that the game doesn't exist.
There was at least one (probably more) paltry DOOM map mods depicting Columbine High School. Not finished products, very poorly done, but enough to fulfill Alito's claim.
In the spirit of the reference, there was a high-quality simulation of the JFK assassination, with the user acting as Oswald. Made a big splash of outrage in the news and was soon pulled (the contest to reenact "the shot" didn't help). It's still available if you look.
BUT...keep in mind endless movies & books depicting gratuitous violence. Alito hasn't articulated a difference on "freedom of speech" grounds.
For people positing hypothetical solutions to this problem, consider this exercise. Look at this case: http://scholar.google.com/scholar_case?case=1254092943969513.... This is a Supreme Court case from 1899. There is a citation to a Sixth Circuit case on page 238 (after "'reserved' cities") that has disappeared in the Google Scholar copy, but appears in Westlaw as: 54 U.S. App. 723, 85 Fed. Rep. 271, 29 C.C.A. 141, 46 L.R.A. 122. Westlaw still happily pulls up this 114 year-old citation to the Federal Reporter. That's the sort of time scale legal documents need to operate on. I'm not really convinced anything on the internet as we know it today can offer permanence comparable to printing out a bunch of copies and shipping them around the country.
The industry standard for citations between scholarly publications is CrossRef, which is the official DOI link registration agency for scholarly and professional publications. I don't know what the original document said, but if it was a scholarly publication, the citation should/could have been done by DOI. DOIs resolve to URLs, but publishers have a mandate to keep them up-to-date.
So, what's the solution, here? It seems reasonable for a court opinion to include a copy of the source, if possible. What about for something like a YouTube video, which could disappear at any time, but can't be represented on paper? How would you agree on a digital format for representing a supreme court opinion?
The Administrative Office of the U.S. Courts has recommended, and many (most? all?) of the federal circuits now have an official policy of, saving web page links to PDF and filing them along with opinions. The Supreme Court tends to do its own thing, but I think that's a simple effective solution.
This is our policy for student papers of all sorts as well. All cited weblinks must be submitted in PDF form as well. Pretty sure that's super standard practice.
You are citing a source. MLA and other schemes require you to give an access date, and the only way to definitively prove you saw content on a given date is to provide the content you saw. Web servers don't magically give users a way to go back in time.
Whether PDF is an appropriate format is a different story.
> the only way to definitively prove you saw content on a given date is to provide the content you saw.
Providing a copy of the content that you claim is the source does not come anywhere close to definitively proving that that content is the content at the cited source on the identified date.
All it does is provide what you claim to be the original source, which, assuming that you do it honestly, provides the content backing the characterizations for which you cited, and the context for any limited excerpts you quoted. So, its useful, but not as proof that you saw the content in the cited source on that date.
this was submitted to HN some time ago. It automatically pulls the website and creates a certificate of content and date that is cryptographically signed.
Build awareness of the importance of the permanence of URIs and what they mean to Hypertext. The only full solution is a cultural one, not a technological one.
Culture isn't going to encourage people to host websites they're no longer interested in, where by "host websites" read "pay money for".
The only solution is some form of legalized archiving. We need the right to copy for archiving without profit, or something. Not sketching out a full, legal solution here, just pointing out that it has to include some form of right-to-archive.
Archive.org, as cool as it is, does not archive the entirety of the internet. Moreover, it's a bit silly to posit that one website that we can surely rely on to exist always! is the solution to the fact that you can never rely on the existence of any particular website in the future.
No, I'm not saying that "the internet archive completely solves this problem", what I'm saying is "there exists no legal barrier to making something like this a reality, since things like the internet archive already exist".
While I agree that there's a necessary cultural aspect here, it does not seem sufficient in this context. If the highest court in the land is citing a source in a decision, then that source needs to be available, by technological solution, if necessary, regardless of external, cultural forces.
Sure, sure. That's comparable to how to fix communism. All we need to do is change the way human beings work, change fundamental economic incentives, and everything will be golden.
It's just not possible. There's only one way to fix these problems, and it's not something most people are ready to accept. It's to turn the internet inside out, from a client-server model to a store-and-forward model. It's to treat web content more like ebooks than like fliers posted to a cork board.
There are upsides, like decentralization, and downsides, like everybody knowing exactly what data you're asking for if it's strictly addressed by hash (no way of obscuring your political activities from an oppressive regime by using an unrelated name for common data).
I wonder if archive.org would provide a paid authenticated (notarized) snapshot service, on-demand (i.e. by request and after payment, create a copy and guarantee it's not changed and deleted ever) - maybe they could even make some money out of the thing.
Isn't the answer to this DHT? A well chosen hash algorithm should give you suitable prevention from collisions and serve as a way to verify the contents of the retrieved file.
The point "about the transience of linked information" has been made. But the problem isn't new to anyone, including NYT. It should pointed out that Justice Alito included a date with the URL, which allows the URL to function as a citation even after the content changes. It's no different than citing an unpublished source.
In fact the actual court decision quoted has everything you would want to know about what was on the web page:
14 Webley, “School Shooter” Video Game to Reenact Columbine, Virginia Tech Killings, Time (Apr. 20, 2011), http://newsfeed.time.com/2011 / 04 / 20 / school - shooter - video - game - reenacts-columbine-virginia-tech-killings. After a Web site that made School Shooter available for download removed it in response to mounting criticism, the developer stated that it may make the game available on its own Web site. Inside the Sick Site of a School Shooter Mod (Mar. 26, 2011), http://ssnat.com.
The page was cited in a Supreme Court decision, but of course giving a URL is far less permanent than giving a citation to an article in a newspaper, magazine, journal, etc. Since the Supreme Court just gave the URL, if it weren't for the Wayback machine coincidentally happening to grab it, we would never be able to see what Justice Alito was actually referring to.
Wow. There is a whole point why we were taught not to trust a web site in general. Printed materials usually last longer. I am so thankful to learn about this case tonight. Thanks!
Had to use the good old "Wayback machine" to see what was originally on that site that warranted some citing in court case. Looks like it was all about the school shootings. here's the link: http://bit.ly/1bF1kB4
From the first two paragraphs: "Supreme Court opinions have come down with a bad case of link rot. According to a new study[1], 49 percent of the hyperlinks in Supreme Court decisions no longer work.
This can sometimes be amusing. A link in a 2011 Supreme Court opinion[2] about violent video games by Justice Samuel A. Alito Jr. now leads to a mischievous error message[3]."
1. http://blogs.law.harvard.edu/futureoftheinternet/2013/09/22/...
2. http://www.law.cornell.edu/supct/html/08-1448.ZC.html
3. http://ssnat.com/