This is really cool, but why reinvent the wheel? For instance SQLite already has tons of years of optimization regarding storing and accessing files on disk.
To make SQLite decentralized (like Hyperdrive) you can put in a torrent. Index it using full-text-search https://sqlite.org/fts5.html for instance. Then let the users seed it.
Users can use sqltorrent Virtual File System (https://github.com/bittorrent/sqltorrent) to query the db without downloading the entire torrent - essentially it knows to download only the pieces of the torrent to satisfy the query. This is similar techniques behind Hyperdrive I believe just again, using standard tools and tech that exists and highly optimized: https://www.sqlite.org/vfs.html
Every time a new version of the SQLite db is published (say by wikipedia), the peers can change to the new torrent and reuse the pieces they already have - since SQLite is indexed in an optimal way to reduce file changes (and hence piece changes) when the data is updated.
That's very interesting, thanks for the links. I'm working on Scuttlebutt (like Dat/Hypercore) and have been working on reimplementing our stack with 'boring' tooling like SQLite and HTTP, and I've been really enjoying it so far.
I'm going to read your blog post now, thanks a lot for the new info.
Christian has also put some work into the underlying database and such lately, but the user facing part of that is Oasis [1] which aims to be an ssb interface that has a no-JS UI, with all the logic being handled by the (locally running) nodeJS server.
It seems like the blog post answers your question pretty thoroughly. The Hyperdrive index and the protocol are tuned for this use case, making it scale to being able to host a Wikipedia clone. BitTorrent FS + SQlite are not tuned for this use case.
Compressed with 7-Zip, sure, but uncompressed, the entire thing takes up 10TB. The Hyperdrive post doesn't mention compression at all, so the comparison should be without it.
> As of June 2015, the dump of all pages with complete edit history in XML format at enwiki dump progress on 20150602 is about 100 GB compressed using 7-Zip, and 10 TB uncompressed.
What do you mean? The author (say wikipedia owners) can change the db as they usually would change (using UPDATE queries say). Those write queries will result in the least-amount of disk-pages updates. In the torrent world this equals a minimum set of pieces modified and needed to be downloaded by users.
No, the pieces you downloaded can be reused for the new torrent download. The pieces will effectively have the same hash hence can be reused for the new digest: http://bittorrent.org/beps/bep_0038.html
This is also why sqlite is a good choice because it's highly optimized to do the least amount of changes to its "pieces" when an update occurs.
If you're implementing this behavior, trying to manage all kinds of different queries, building a querying engine on top of that, optimizing for efficiency and reliability, you're effectively rewriting a database. Sure you can do it, but why not take advantage of battle-tested off-the-shelf stuff for things like "databases" (sqlite) and/or "distributing data" (torrent)?
All those BEPs are in "Draft" status. Okay, libtorrent implements two of them. But also, BEP 39 (Updating Torrents Via Feed URL) doesn't really fit very well into the fully distributed setting because of the centralized URL part.
So now to update the torrent file you need a mechanism for having a mutable document you can update in a distributed but signed way. Or you could make an append only feed of sequential torrent urls... oh wait.
My point is: Hyperdrive's scope is sufficiently different from your proposed solution that yes, you could probably rely on existing tools (and I have much love for bittorrent based solutions!) but it starts feeling like shoehorning the problem into a solution that doesn't quite fit.
That draft status is of little practical relevance, though, if nothing changed for years, and no one voiced well-founded critic on the technical details.
I do agree though that Hyperdrive is different from what the bittorrent ecosystem has to offer. I too like not reinventing the wheel where that's not necessary, as you recommend there. I'll leave you the list of BEPs for further reading, in case you're interested: https://www.bittorrent.org/beps/bep_0000.html
I've been keeping an eye on that list for a long time. There's some really cool stuff in there, and I think bittorrent has really been within reach of being "simply good enough for most applications" for quite some time now. And the massive user base is of course a good thing there, especially if you're talking more about archival projects.
Would sqltorrent setup make sense for sharing scraped/pulled data amongst users. So each user can run the data-extraction themselves or check if anyone has ingress chunks to their liking on the swarm? Everying is append-only content addressable at it's base.
I've been looking around IPFS, dat, hyperdrive etc and it seems like dat is the most natural setting for this but sqltorrent is new to me.
Wouldn't seeding be a problem? You would need to seed from something that supports webtorrent which uses WebRTC.
With dat-sdk, users just need to go to a webpage. You really just need WebRTC without torrents.
I got rid of multiwriter by just having a dat archive for each user, and the users sharing their dat addresses with each other. They write to their own. When that happens, events emit and users listening write to theirs.
If enough users stay on, listening to each other's address, I only need a web client.
Also, if I have offline support, like Workbox Background Sync, I don't even need internet and information transfers device to device with just an offline PWA. At least that's my goal.
To make SQLite decentralized (like Hyperdrive) you can put in a torrent. Index it using full-text-search https://sqlite.org/fts5.html for instance. Then let the users seed it.
Users can use sqltorrent Virtual File System (https://github.com/bittorrent/sqltorrent) to query the db without downloading the entire torrent - essentially it knows to download only the pieces of the torrent to satisfy the query. This is similar techniques behind Hyperdrive I believe just again, using standard tools and tech that exists and highly optimized: https://www.sqlite.org/vfs.html
Every time a new version of the SQLite db is published (say by wikipedia), the peers can change to the new torrent and reuse the pieces they already have - since SQLite is indexed in an optimal way to reduce file changes (and hence piece changes) when the data is updated.
I talk a bit about it here: https://medium.com/@lmatteis/torrentnet-bd4f6dab15e4
Again not against redoing things better, but why not use existing proven tech for certain parts of the tool?