OK, this is quite a serious vulnerability in Subversion. SVN depends more on raw file SHA1 hashes than git because git prepends a header which prevents raw SHA1 collisions from translating directly into easy svn-style repository corruption.
The reason svn is broken is its "rep-sharing" feature, i.e. file content deduplication. It uses a SQLite database to share the representation of files based on their raw SHA1 checksum - for details see http://svn.apache.org/repos/asf/subversion/trunk/subversion/...
SVN exposes the SHA1 checksum as part of its external API, but its deduplication could easily have been built on a more secure foundation. Their decision to double down on SHA1 in 2013 was foolish.
> this is quite a serious vulnerability in Subversion
I rather believe it's a minor bug, and that once it is fixed, they can actually keep using SHA1 as before, without having the denial of service when somebody tries. Then, for example, if somebody actually tries to put two files with the same SHA1 but different MD5 they can reject the second one before accepting it. Or they if there are two different files with same SHA1 and they accepted both and they store only one content, SVN can still continue to work. So you can't get the second unless you, for example, put it in some archive format first and then put in the SVN, OK, your problem, the SVN would still work for anything else.
In short, it sounds like a denial of service at the moment, but I think that DOS can be avoided without changing the hash algorithm.
However, I'm sure that SVN is not the only source base that was never up to now tested with two different files that have the same SHA1.
I didn't realise yesterday that svn also uses SHA1 for deduplicating pristine filed in its working copy. So disabling rep-sharing isn't enough to prevent broken checkouts: you need to prevent any SHA1 collisions from being committed. See the link from Stefan Sperling (stsp) to the collision rejection script elsewhere in this thread. There is more info about what they need to do to fix this from Stefan at http://mail-archives.apache.org/mod_mbox/subversion-dev/2017...
As mentioned in a previous comment ( https://news.ycombinator.com/item?id=13722469 ) git doesn't see these the same as it hashes the header+content which breaks the identical SHA trick.
Of course, I first tested this on our main production repository at work because...oh, wait, I didn't because what were you thinking?!
I don't think they meant to test it on the production repository. Rather, they added a test for something in WebKit, and it didn't occur to them that it would be "testing" the repository too.
It could be made to work on Git, but you'd need to make a collision that included the git blob header. The resulting files would not have the same SHA-1 hash until the header was added though, so they wouldn't be useful except for testing Git itself.
My guess is that Git wouldn't be 'hosed' like SVN, since it currently doesn't have a secondary hash to detect the corruption. It would simply restore the wrong file without noticing anything was amiss.
> It would simply restore the wrong file without noticing anything was amiss.
Why hasn't Git switched to SHA-2? People have been warning that SHA-1 is vulnerable for over a decade, but that vulnerability was dismissed with a lot of hand-waving over at Git. Is it a very difficult technical problem to switch, or just a problem of backward compatibility for existing repos (i.e., it would be expensive to change everything over)?
> Is it a very difficult technical problem to switch, or just a problem of backward compatibility for existing repos (i.e., it would be expensive to change everything over)?
A bit of both. Git has an ongoing effort to replace everywhere in the source code that passes around SHA1s as fixed-size arrays of bytes with a data structure. That'll make it possible to replace the hash. But even with that work, git will still need to support reading and writing repositories in the existing format, and numerous interoperability measures.
Mercurial hasn't switched either for similar reasons, although the format did reserve space for a bigger hash (32 bytes over sha-1's 20 bytes) since 2006, less than a year into hg's existence.
Linus suggested changing the hash algorithm doesn't need these changes, just take the first 160 bytes of a SHA2 or whatever and use that. The chances of collisions would still be less than SHA1.
I think the more important collision to worry about in 2^80 time is the Earth colliding with the Sun.
Our interstellar successors, if any, will probably have found something better than Git to use.
EDIT: I should be clear that I'm not making the usual silly claim that we don't need to worry about hashes being broken because they take forever to brute force. I'm saying that hashes will be broken, but not by brute forcing the entire hash space. A decade or so of cryptographic research will save you eons of compute time.
2^80 time is not as much as you think it is. The bitcoin network is currently calculating about 3 * 2^60 hashes per second. It can do 2^80 hashes in under a week.
The 2^80 space cost of doing a birthday attack is a lot more notable, but it's not unfeasible either. Yearly hard drive production is somewhere around 2^70 bytes. You're not pulling that attack off in the year 2017 but a big budget you could get probably get there in a few decades.
Checking 2^80 hashes is indeed faster than I thought, thanks for the stats.
But checking 2^80 hashes and writing them to long term storage is still ridiculous. That budget should still go to hiring cryptographers, not buying hard drives.
His style is a super abrasive unnecessary power trip but this key point is relevant:
> It is simply NOT TRUE
that you can generate an object that looks halfway sane and still gets you the sha1 you want.
The key phrase being "looks halfway sane". Git doesn't just look at the hash. It looks at the object structure too (headers) and that makes it highly resistant to weaknesses in the crypto alone. His point essentially is you should design to expect crypto/hash vulnerabilities, and that's a smart stance, as they are discovered every few years.
Linus was not talking about the object headers, but about the object contents. It's harder to make the colliding objects look like sane C code, without some strange noise in the middle (which wouldn't be accepted by the project maintainers).
Yes, it's a "C project"-centric view, but consider the date: it was the early days of git. The main way of receiving changes was emailed patches, not pull requests. Binary junk would have a hard time getting in. And even if it did get in, the earliest copy of the object wins, as long as the maintainers added "--ignore-existing" to the rsync command in their pull scripts (yeah, this thread seems to be from before the git fetch protocol), as mentioned earlier in the thread.
Honestly, this isn't nearly as abrasive as some things Linus has said and it has some cogent generalized engineering advice mixed in. Certainly not the worst thing he's said. Also, he was correct at the time and left open the possibility of something changing in the future.
It hasn't switched because Linus (1) doesn't think anyone would do that and (2) he sees hash collisions only as an accident vector not an intentional attack vector.
> You are _literally_ arguing for the equivalent of "what if a meteorite hit my plane while it was in flight - maybe I should add three inches of high-tension armored steel around the plane, so that my passengers would be protected".
> That's not engineering. That's five-year-olds discussing building their imaginary forts ("I want gun-turrets and a mechanical horse one mile high, and my command center is 5 miles under-ground and totally encased in 5 meters of lead").
> If we want to have any kind of confidence that the hash is reall yunbreakable, we should make it not just longer than 160 bits, we should make sure that it's two or more hashes, and that they are based on totally different principles.
> And we should all digitally sign every single object too, and we should use 4096-bit PGP keys and unguessable passphrases that are at least 20 words in length. And we should then build a bunker 5 miles underground, encased in lead, so that somebody cannot flip a few bits with a ray-gun, and make us believe that the sha1's match when they don't. Oh, and we need to all wear aluminum propeller beanies to make sure that they don't use that ray-gun to make us do the modification _outselves_.
> So please stop with the theoretical sha1 attacks. It is simply NOT TRUE that you can generate an object that looks halfway sane and still gets you the sha1 you want. Even the "breakage" doesn't actually do that. And if it ever _does_ become true, it will quite possibly be thanks to some technology that breaks other hashes too.
> I worry about accidental hashes, and in 160 bits of good hashing, that just isn't an issue.
I don't mean this to say that you are being inaccurate, just that his current position seems a little different now:
"Again, I'm not arguing that people shouldn't work on extending git to
a new (and bigger) hash. I think that's a no-brainer, and we do want
to have a path to eventually move towards SHA3-256 or whatever"http://marc.info/?l=git&m=148787457024610&w=2
Assuming you meant "hide behind", his original attitude seems to be more like "this is sufficiently unlikely in practice that I consider attempting to mitigate it in advance to be overengineering with a higher opportunity cost than it's worth". Which, well, I think when it comes to security stuff he has a nasty tendency to underestimate the risks and thereby pick the wrong side of the trade-off, but to me it's clearly a trade-off rather than something to hide behind.
I think it's a reasonable assumption that, as computing power increases, hash functions will be broken. Not that they have to be, but it's reasonable to assume that, and I think it's beyond short-sighted for Torvalds to have failed to build a mechanism in git for hash function migration into git from the very start.
Cryptanalytic research is the fundamental thing that broke SHA-1, not simply the increase in available computing power. So that's not really a 'reasonable assumption', if it was, we could 'reasonably' assume SHA-512 will never be broken.
The point remains, since computing power increases, and cryptanalytic research advances, we really should make sure software that depends on cryptographic hashes has a reasonable way to move to different algorithms. At the very least we could add as a prefix to the resulting hash the name of the algorithm that generated it when we store it.
The advances of research and computing power are vastly outpaced by basic things like digest size. If you came up with a complexity reduction of the order of the one developed against SHA-1 for SHA-256, you won't be able to find any SHA-256 collisions.
(from the link) "For the record: the commits have been deleted, but the SVN is still hosed." That is pretty much my memory of working with SVN. I remember SVN fouling its database a few times. Sure I've broken git a few times, but I am always able to (as Jenny Bryan says) "burn the whole thing down" and take state from another copy of the repository.
I really tried with SVN (wanted something better than CVS) for quite a long time.
I've done surgery on svn repos to unhose things a few times over the years, usually due to PEBCAK rather than svn shitting itself. It's actually pretty doable, up to and including the equivalent of interactive rebase.
I much prefer that git's designed to let me do such things and provides tools for doing so, but you can totally rewire svn repos with vi and a bunch of swearing if necessary.
(and I was using svk for a merge tool at the time so I did have the option to burn it down and rebuild from scratch; unhosing svn repos wasn't quite unpleasant enough for me to want to do so)
Then again, I started off doing more ops than dev and have also happily hand-edited mysql replication logs to unfuck things after a partial failover, so I may have more of a masochistic streak than you do :)
To provide a counter anecdote, my company has used SVN for 10 years across hundreds of thousands of commits for a repo that is now 1.2TB in size and not once have we needed to restore from backups.
The bugs you can expect from software that assumed no hash collions are going to be pretty arbitrary. There was that stack overflow post about what happens with Git with collisions and it didn't seem great either, it's just that what gets hashed happens not to collide in this case.
Reminds me of when I worked at an antivirus company. We had be careful with the EICAR file in test code because it would set off AV alarms. http://www.eicar.org/86-0-Intended-use.html
A bit hard for me to tell what happened here, maybe because I don't know anything about SVN. The two PDFs with equal SHA1 hashes were git commited to the repository, but converting that to an SVN commit failed because... SVN can't handle two separate files with the same SHA1 hash?
It's likely some part of the svn implementation that assumes that the SHA1 signatures guarantee uniqueness within a repo. And they might use that hash as an identifier.
I'm guessing shattered-1.pdf and shattered-2.pdf have identical hashes but distinct contents. It's not clear for me to know why this results in a "checksum mismatch."
A 256-bit hash, even a bad one, is usually more secure than a 128-bit hash, even a good one. But a 256-bit hash designed to be used as a 256-bit hash is probably going to be better than trying to come up with your own one ad-hoc by combining two 128-bit hashes. E.g. many hash functions have parts in common, so you might not get 256 independent bits that way.
In this scenario, the reasoning usually is: since both sha1 and md5 are vulnerable, it is possible to construct a document in which both the sha1 and md5 match. I don't know the feasibility of this, nor do I know how much compute time it would add. But that is a typical argument against those type of "combine two hashes" approach
No. You can't assume that two documents match because their hash values match. That's what caused this problem. You don't solve a problem by doing the same thing that caused the problem in the first place. For any given string s of unbounded length and its hash h, there are an infinite number of strings s', s'', s''', etc that have the same hash value h. Change from 128 to 256 bit hash? Great, there are still an infinite number of collisions. Change to two concatenated hashes? Guess what: infinite number of collisions.
A hash is not a unique identifier. Period. It's only useful as a quick filter before you do a full comparison.
It's amazing how resistant people are to using hashes safely. They willfully ignore the birthday paradox and say LA LA LA 1/(2^128)=0 and because they haven't lost all their data yet they tell themselves that their shoddy practices are OK.
They were directly committed to the SVN repository, apparently breaking SVN's tooling even after the commit had been deleted. The git-to-SVN mirror script was the first place where a failure was noticed and was initially thought to be the only broken bit.
I have to just say here that WebKit is one of the most over-the-top software projects I've ever tried to dig into, in my twenty years of programming. Building it inside a vanilla container was impossible following their directions exactly and required so much research on my part to get working. I'm used to a bit of back-and-forth with just about every project, but WebKit was ridiculous. After two workdays of trying, I'd been able to build a WebKit from the source, but at that point had to concede to the universe the futility of trying to build a golang-based Phantom, as my friend and former coworker originally wanted. And that also gave me mad respect for Phantom's author and immediately taught me why they do not often incorporate new WebKit versions into the project instead of just pegging to the first one they can get to build.
The reason svn is broken is its "rep-sharing" feature, i.e. file content deduplication. It uses a SQLite database to share the representation of files based on their raw SHA1 checksum - for details see http://svn.apache.org/repos/asf/subversion/trunk/subversion/...
You can mitigate this vulnerability by setting enable-rep-sharing = false in fsfs.conf - see documentation in that file or in the source at http://svn.apache.org/viewvc/subversion/trunk/subversion/lib...
This feature was introduced in svn 1.6 released 2009, and made more aggressive in svn 1.8 released 2013 https://subversion.apache.org/docs/release-notes/
SVN exposes the SHA1 checksum as part of its external API, but its deduplication could easily have been built on a more secure foundation. Their decision to double down on SHA1 in 2013 was foolish.