Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My second reply, but, I think that this is really important.

We only need to look at early film history to know how easy it is to lose massive parts of our history.

Going back to old pages, I frequently get 404 results. For politically sensitive documents, the problem is much more widespread.

I would like something that not only archives pages I visit, but also versions them and tracks changes. If there was a bookmarking tool that did this, you could easily have an opt-in feature that shared content. This type of system would be a huge boost to something like the wayback machine.



What do you think of this generalized architecture?

HARDWARE: differs depending on whether you want local search/analytics or just network storage.

For mobile use, either a VPN back to your personal home/cloud server, or a hackable wifi hard drive proxy, e.g. Seagate Wirless Plus + HackGFS.

For non-analytics home use, hackable router with USB3 storage and Linux software RAID, connected to a USB3 drive chassis with room for 2-4 disks.

For analytics home use, a microserver like HP N54L, Dell T20 or Lenovo TS140. Up to Xeon processor with ECC memory, plus 4-6 internal disks and up to 32GB RAM. Sold without a Windows tax, supports hardware virtualization and Linux. Possibly FreeBSD with ZFS.

SOFTWARE: generalized multi-tier cache AND compute. Camlistore and git-annex are tackling multi-device storage sync. For archives, we need a search interface that will query a series of caches, e.g. mobile > home > trusted friends private VPN (tinc overlay) > public paid cloud archive (pinboard et al) > public free cloud archive (archive.org).

It's important for usability to have a simple, local UX that will take a search string, propagate across all private/public federated tiers of storage and compute, then aggregate the metasearch results on the client.

With this approach, we can collectively pool resoures to improve on CommonCrawl.org, without locking up the 300TB index at AWS. This would turn web search engines into a secondary source, rather than a primary source. First search your archive + trusted friends, then trusted verticals (e.g. HN, StackOverflow), then a generic web search.

Let's be clear: the goal is not to archive "everything" in the world, only that which is personally important to the viewer. This attention metadata has long-term value. With this architecture, it is always optional to escalate a query to a public archive or search engine. Most importantly, there is technical autonomy and low-latency compute for local queries.

For web pages, wget of WARC formats (per HN advice on another thread) and wkhtmltopdf (available as Firefox plugin to print to PDF) will keep local archives. Recoll.org (xapian front-end with user-customizable python filters) on Linux will search full text and provide preview snippets, or lucene/solr can be adapted.


Holy living God.

Solves my problem, then some. Also provides an alternative to search engines as the goto for internet browsing. Brilliant stuff. I can't tell you how grateful I am for your work on this.

The only thing left is FLOSS version control for sound and video editing, and an effective "publish to BitTorrent" feature, and then we can pretty much put this "Web 2.0" crap to bed.

EDIT: Out of curiosity, is there any reason that analytics couldn't be done with a dedicated PC, or do you think it requires server hardware to run effectively?


It can be done on a dedicated PC, ideally one that supports h/w virtualization (VT-x and VT-d). People who want to purchase a new PC would need known-good configs. The suggested devices are relatively cheap (no Windows tax, the Dell one officially supports drivers for RedHat Linux) and can be used as PCs. With virtualization support, one could run a local Windows desktop, analytics in a Linux VM, and NAS storage in a separate Linux/FreeBSD VM - all on the same computer. For defensive security, you want to separate a read-only content store from potentially-vulnerable programs which parse & analyze data in the content store.

As someone said in another comment, the challenge with these solutions is making them usable to a mass audience who won't know or care about Linux. Android & OS X both created user experiences that hid the underlying Unix OS. If someone can only afford a single computer, then virtualization allows that device to play the role of both "server" and "desktop".

With local s/w RAID, the system can be designed so that backup consists of (1) shutting down the computer, (2) removing one hard drive, replacing it with another, (3) take the (encrypted) hard drive to offsite location, e.g. trusted friend/family.

> The only thing left is FLOSS version control for sound and video editing

Coud you expand more on this use case? Do you mean archiving binary blobs and storing their metadata in git? git-annex does this and camlistore can be adapted for this purpose. Is this only for local production workflow, or is there a need for remote collaboration on pre-final artifacts?

> an effective "publish to BitTorrent" feature

Could you spec out what's needed? Are there trackers which specialize in public domain content? Strong DRM is coming to browsers and new hardware. While it will create unexpected user experiences and change the web as we know it today, it could increase demand for public domain content. We are going to need distribution channels for public domain content which treat copyrighted content like viruses, i.e. reject before they can enter the channel.

The same fingerprinting techniques used for copyrighted content can only be used to improve discovery of public domain content. This means central directories (which can be cached locally) of metadata and hash/fingerprints for public domain content (text, audio, video, raw data).

Another feature: support for software agents that take action based on computed result of (remote event + private data). A FLOSS version of IFTTT, Yahoo Pipes, e.g. https://github.com/cantino/huginn . A remote event could be a private signal from a mobile device app. This would increase flexibility since the user has a larger private dataset about themself than any web/cloud service. The user can configure FLOSS algorithms, instead of relying on remote black boxes without appeal or transparent governance.


I'm starting to think that remix culture has been essentially stifled by licensing. Artists like Steinski, Danger Mouse, and Girl Talk have gained notoriety exclusively through word of mouth. Copyright and licensing prevent them from monetizing their music, so they are forced to promote and release it in other ways, either quick pressed vinyl or free internet distribution. In a remix culture, I'm hesitant to describe anything as "pre-final", or, perhaps we should label everything as "pre-final".

I think a public domain or creative commons music community could develop a thriving remix community in unique ways. Remixes can be recontextualized as forks of music. I would imagine that fingerprinting could be used really effectively here, which will require some thought. As a recording artist, it would be really useful for me to be able to delete takes but retain them in version history. The music industry is filled with stories of the person who owns the recording studio retaining the masters to a session, and then refusing to cooperate with the artist. Opening that data up would be a massive boon not only to musicians but I think to recording studios as well, as the finer aspects of a recording session become much easier to access.

There's a deeper problem, though. There's a dichotomy between the binary blobs that DAWs use and the user-readable stem files that artists like Radiohead release. It's as though all the information about the studio session and all the non-temporal aspects of editing are obfuscated by assembly code. Ideally, it would be nice to replace the binary blobs that programs like Audacity and Ardour save to with human-readable stems that encode sound or video editing in metadata. Ultimately I think this is a critical UI requirement, but for now I think the best solution is to use git-annex to store .flacs and .oggs, as well as both individual exported stem files and binary project blobs. In the final analysis, different DAWs are really more like different instruments, and cross compatibility between instruments is a requirement.

Uploading files to a server is pretty easy, but what is a little more challenging is automatically licensing the content, uploading to your server, creating a torrent that uses the server as a webseed, and then publishing it to a tracker. This publishing flow would let me directly publish content even on my puny little shared web hosting account. Traffic load is automatically distributed through bittorrent, and censorship becomes much more difficult (albiet not impossible.) From there, I think, it would be fairly trivial to build a front-end to replace an interface like YouTube's "Submit Video" page.

Treating copyrighted content like viruses is a feature I would really like to see more of. While I'm a firm supporter of PopcornTime/Time4Popcorn/AsPopcornGoesBy, I don't really want to use it. I'd rather find public domain things to watch than Hollywood Movies or TV shows. I wish PopcornTime had a public domain feature.

If there are trackers that only use public domain or Creative Commons content, nobody has invited me to them. ;) I think it wouldn't be too hard to make a tracker that scanned for a machine readable public domain or CC license, but I don't have much experience with torrent trackers. I'd like to try and set one up soon, I think initially you could just moderate content. I'm kind of baffled by the direction Bittorrent, Inc. has taken, because they could just as easily be advocating for number of seeds as a surveillance-free fanbase metric, rather than just releasing DRM'd music through Bittorrent.

Huginn looks really, really interesting. I sure appreciate your posts. Thanks very much.


It clearly hasn't occurred to you that private collections evaporate over time, just like public ones.

> I would like something that not only archives pages I visit, but also versions them and tracks changes.

That's a huge storage requirement, you must realize this. If you're an avid Web browser, and if every archived page had to look as it originally looked (i.e. all the linked resources) you could accumulate several terabytes per week.

> This type of system would be a huge boost to something like the wayback machine.

Here's the road to madness. Someone, aware of the rapidly declining cost of storage, rebuilds the Wayback machine based on your scheme, with the intent of archiving every Web page in existence, including all required resources so the pages look just as they originally did. Then, as the project approaches completion, this genius says, "For the next phase, I need to archive the Wayback machine itself." At that point, as the implications of what he's said occur to him, a strange look crosses his face and his imagination begins writing checks his intellect can't cash.


Several terabytes per week is a huge overestimate. Right now I'm getting 12 Mbps download speed (sadly typical for US broadband). If I saturated the connection, and if no one throttled me, I could download 900 gigabytes in a week. That would be some intense web surfing.

My practical experience is that you need about 1.5 MB per URL for storing large numbers of web pages, if you exclude video.


For compressed static html and maybe images, you could store your entire history on a flash drive nowdays. See this: http://memkite.com/blog/2014/04/01/technical-feasibility-of-...


I never said I was trying to save my entire web cache locally. Just bookmarks.

I don't have Flash Player installed. Instead, I use youtube-dl for flash videos across many websites. I'm well aware of the storage implications of this kind of activity.


The storage implications are really not so bad. You would have trouble ever breaking 100 GB unless you are some kind of bookmarking mutant.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: