Even if you could break SHA1, it's unlikely that your replacement source code wo...

flingo · on Aug 10, 2020

Why would it require that much data? I always thought you wouldn't need to add or change more bytes than are in the output.

Also, git hashes aren't just based on source code. You can add that data anywhere that git uses to generate the hash.

db48x · on Aug 10, 2020

That's true of a CRC code, but hashes are a lot harder to break.

Git hashes each file, and puts those hashes into a tree object, like a directory listing. Then it hashes the trees, recursively back up to the root of the repository. Finally the hash of the root tree is put in the commit object, and the commit object is hashed. Thus the two places you can put additional data to be hashed are the file contents (either in existing files or new files), or in the commit message. You can get a few free bits by adjusting less obvious things like the commit timestamp or the author's email address, but not nearly enough to make your forged commit have the same hash as an existing commit.

flingo · on Aug 10, 2020

I'm still not following why it'd require so much data? I thought the goal was to have the commit hash collide with an existing commit hash, is that not enough?

I looked around, and it seems like the right place to hide the added data is in the "trailer" section of the commit. It's where signed-off-by lives and is used to generate the commit hash.

You might want to come up with a plausible reason for random data to go in there though. (likely using a header that wouldn't normally get printed out)

db48x · on Aug 11, 2020

In a CRC-style code, you're essentially adding up all the bytes and letting it overflow the counter, so that the counter is a fixed size (usually 16 or 32 bits). Then you add a few more bytes, exactly the same size as the counter, so that the data bytes plus the extra bytes add up to zero. The extra bytes are delivered along with the data bytes, so that the recipient can repeat the calculation and verify that the total is still zero. If you modify the data, it is trivial to recalculate the CRC code so that the total is still zero.

Hashes are much, much more complex, and they're non-linear. Each bit of the hash output is intended to depend on every single bit of the input, so that changing a single bit in the input creates a radically different hash output.

In a paper published this year, https://eprint.iacr.org/2020/014.pdf, the authors Gaëtan Leurent and Thomas Peyrin changed the values of 825 bytes out of a 1152 byte PGP key in order to generate a new key with the same signature (aka, the same hash). It only cost about $45k, too.

kibwen · on Aug 10, 2020

The git hash surely also takes the contents of binary files into account, so I imagine that in any repo that contains non-text files, an attacker would try to hide the garbage inside e.g. some metadata field of an image file.

db48x · on Aug 10, 2020

That's true. PDFs and other document formats are also great because you can include large volumes of data that is never used in the final output.