I just want to say how much I love this blog. I'm a self-taught programmer, which means that I tend to learn algorithms as I feel I need them (a big problem given that I have to know what I don't know before I can know it!). I've recently gotten more into computer science for its own sake, but books and papers about algorithms are often dry and unmotivating.
The pieces on this blog are always entertaining and clearly justify the excitement and purpose of the algorithm under discussion.
Computer science and the STEM world in general needs more stuff like this, and I'm happy that we are in fact seeing more and more of it. Well done.
> books and papers about algorithms are often dry and unmotivating.
If you don't like standard textbook treatments like CLRS, you might like these: Introduction to Algorithms - A Creative Approach by Udi Manber (out of print but worth getting). Algorithm Design by Klein and Tardos. How to Think About Algorithms by Jeff Edmonds. Jeff Erickson's lecture notes on algorithms: http://www.cs.uiuc.edu/~jeffe/teaching/algorithms/
Jon Bentley's Programming Pearls is about what I'd call algorithm engineering. The algorithms are never anything fancy but it's a good example of algorithms applied to real-world problems. It's possibly the best single book on programming I know. Make sure to do all the exercises.
Thanks! I'll take a look. It's not so much that I don't like them, it's that I seem to be able to "get into" a paper or textbook after I understand what's so cool about what they're talking about.
In that case Programming Pearls will probably be a breath of fresh air. It doesn't start from the algorithm but from the problem to be solved and tends to first dip into the most naive solution before rationalizing why specific approaches are important.
From that perspective Skiena's book is also good, as it offers war stories where you can see first hand some of the implementations of the algorithms mentioned.
I'm also self-taught but find not all comp. sci. books are dry, particularly those with a Bell Labs background. Something about their writing style perhaps.
You may want to peruse Aho and Ullman's excellent C edition of Foundations of Computer Science. (That's Aho as in awk.) What you need for a good grounding all in one place. You can still pick up a paper edition if you want but they've also made it available as PDFs.
If I'm understanding the application correctly, this would also work with Tillich-Zemor-style hashing based on matrix groups. The idea is to map the 0 and 1 message bits onto a pair of generators of a subgroup of the invertible 2x2 matrices with entries in some GF(2^n). The message hash is just the product of the generators corresponding to the message bits.
What's really cool is that matrix multiplication is associative, i.e. x(yz) = (xy)z, so you can perform the hashing in parallel: Hash the left half, hash the right half, and combine the results by multiplication. It also allows distributed multi-party hashing, as in this article's application.
Unfortunately, classical TZ hashing has fallen completely to practical collision attacks as well as first and second preimage attacks. However, those attacks all rely on very special properties of power-of-two Galois fields, so there's no reason generalized TZ-style matrix group hashing wouldn't be feasible.
Cool stuff, my head hurts less than expected from the description, I can't speak for the accuracy of the explanation, but the clarity was just about where I needed it.
It might help that I've recently been re-reading "The Mathematics of Ciphers" by S.C. Coutinho[1] which has the best easy-on-ramp introduction to the number theory behind RSA that I've encountered.
This seems cool, but I don't understand how the given algorithm addresses the use case he gives.
The stated goal is to be able to verify pieces of a file that has been chunked up. But the first step in the algorithm is get a list of hashes for each chunk from a central server - if you have that, aren't you done?
The paper includes details that show the motivation a bit better. Broadcast erasure codes will multiply the number of blocks by a large amount, say 16. With traditional hashing, you have to generate all of these extra 'check blocks' to hash them as part of the publishing process. Using this technique, you only have to hash the actual file and can then generate all of the other hashes using the homomorphism.
edit: actually, the receiver should be able to use the homomorphism to generate hashes for the check blocks as well. so the list of authoritative hashes to download should only be for the original chunks. the goal of the technique is to be able to verify the check blocks before being able to reconstruct the original blocks.
Actually, it's even worse than that - with a rateless code like Fountain codes, you can generate an effectively unlimited number of encoded blocks, making pregeneration of hashes for them all completely impractical.
Read up on fountain codes. A good starting point is the previous post on the same blog [1], actually linked in the first sentence of the article.
The basic idea is that, instead of sending N blocks in the content file f, you have a block generator gen_block(f) which can create an infinite number of blocks as random combinations of input blocks (plus a little "magic"). A peer who downloads any 1.01N or so blocks can reconstruct the original file (with high probability); because of the "magic," it doesn't matter which 1.01N blocks they are!
Fountain codes work beautifully in broadcast protocols, where the receiver has no way of communicating back to the sender (this is not generally true for TCP/IP outside of exotic scenarios, for example multicast).
For Bittorrent-like protocols (peer-to-peer large file distribution over TCP/IP), the practical problem that is addressed by fountain codes is making sure that the entire file will still usually be recoverable from data that exists within the network if all the seeders (peers with complete a copy) leave the network; in traditional Bittorrent, often there are a few missing blocks in this situation.
The practical problem addressed by homomorphic hashing is how a receiver can rapidly identify peers that are attacking a fountain-code-based P2P network by deliberately sending corrupt downloads (and as a side-effect can of course detect accidental corruption as well.)
I had written my comment after reading the performance section of the paper where they point out that homomorphic hashing uses much less disk I/O than hashing the entire output of a fixed-rate code while missing the point that the receiver takes advantage of the homomorphism as well, and thus arrived at the actual capabilities somewhat circuitously.
One thing probably worth keeping in mind is that you can keep everything in much saner terms for CPU by doing something like [1]. This is how most implementations of RSA handle large exponents also. It can make it very much more efficient compared to handling some 2k bit numbers for multiplication in intermediate steps.
Could this be used to secure a majority vote by asking people to submit their chunk of one of two files? The only way to get either file is by having the majority of the votes.
Not-patenting a (practically, not legally) patentable algorithm doesn't come easily to academic researchers these days. The authors deserve kudos (unlike the rapacious Michael Luby, fountain code inventor) for trying to make sure people can use their algorithm sometime in the next 20 years.
The pieces on this blog are always entertaining and clearly justify the excitement and purpose of the algorithm under discussion.
Computer science and the STEM world in general needs more stuff like this, and I'm happy that we are in fact seeing more and more of it. Well done.