Hacker News new | past | comments | ask | show | jobs | submit login
Git archive generation meets Hyrum's law (lwn.net)
136 points by JamesCoyne on Feb 2, 2023 | hide | past | favorite | 76 comments



If it can't be made stable, `git archive` should specifically add random content (under a feature flag to be removed after a year or two) to so as to make the generated checksum completely unreliable and force users to adopt different workflows.


Golang does this with hashmaps, it deliberately randomizes the keys’ order so that you can never depend on it. People hated it for a few months, but now it’s just another known idiom.


Hyrum’s Law applies, even when you take into account Hyrum’s Law. :)

A story about a Golang program that had assumed map iteration was uniformly random but it’s not, which caused a load balancer to assign work unevenly:

https://dev.to/wallyqs/gos-map-iteration-order-is-not-that-r...

A graph of the map iteration order’s distribution showing that it’s not uniformly random:

https://twitter.com/cafxx/status/1135190309514620928


From the language specification: "The iteration order over maps is not specified and is not guaranteed to be the same from one iteration to the next."

There's a long shot between "not specified" and "uniformly random".


The joke is that the sort of people who rely on apparent behaviour in practice isn’t the sort that tends to look things up in specifications.


Ok, you win, have a favorite.


Perl did that too IIRC, although motivation was to avoid predictable hash collisions that might put program into slow path


Didn't Python introduce that, too, after a high profile DoS CVE around 10 years ago?


Yes.

https://docs.python.org/3/reference/datamodel.html#object.__...

> Changed in version 3.3: Hash randomization is enabled by default.

But since Python 3.6 dicts keep the insertion order, so the order of iteration is deterministic again.


That’s mainly for security reasons, mainly preventing Hash DDoS attacks, therefore a requirement.


Randomly ordering the files, without adding random content, would do the trick.


Interesting idea, to make "undefined behavior" explicit.


Everybody who had to maintain an API knows this.

1. You can't just rely on documentation ("we never said we would guarantee this or that") to push back on your users' claims that you introduced a breaking change. If you care more about your documentation than your users, they will turn their back on you.

2. However if you start guaranteeing too much stability, innovation and change will become too costly or even impossible. In this instance, if the git team has to guarantee their hashes (which seems impossible anyway because it depends on the external gzip program) then they can never improve on compression.

Tough situation to be in.


Someone once stated that every observable behaviour will be depended upon by someone sooner or later.

I can only imagine someone going to great lengths to avoid such "a stable order of operations was never guaranteed" discussion by just randomizing the order of execution or something similar (I bet someone will then use that as a seed for prng).

edit: skipping the first paragraph lead to repeating hyrums law.


I once got a bug report from a user's manager about the values in our application's private database.

It was an internal user interface, intended for employees of our conpany. Once upon a time, we had a process for adding a new record where it had to be added manually to multiple internal systems. So the internal UI had its own copy of the data. But then we built a single source of truth for this data source, that single source of truth had an API which our application would query and so the database table updates were abandoned as they were for the database's internal use only, however nobody ever bothered to remove the old table with a few hundred rows of stale data.

Two years later, we got the bug report then. The users' manager was complaining that the dataset was incomplete, that it was impeding his work, and that it needed to be fixed asap.

It turned out at some stage he had requested and was granted read only access to that DB, and had been querying the records of user actions in that DB to track the volume and quality of work his subordinates did. And then at some other point he realised that he could join against this table to get readable labels rather than opaque identifiers for the types of data said reports were working on. Except of course, the data was two years stale so he was noticing an increasing amount of "missing" labels in his report.

Said user escalated all the way to a VP of engineering before accepting that no, a private database is not a supported interface of our product.


"Users will eventually use your database directly no matter how good your UI/API is" deserves law on its own tbh. Or maybe "the shittier your API/UI is the higher chance that users will just use database directly.


When I had to do user management on a multi-site Wordpress instance years ago, I had to resort to using the database to manage user groups. It was a wild time.


I hope you at least estimated how much work it would be to add a user-facing audit tracking and reporting feature. You could probably charge good money for that.


Considering this was an in-house tool for a very company specific task which had 3 managers that could possibly use that information, it just was never going to be a high priority.


> Someone once stated that every observable behaviour will be depended upon by someone sooner or later.

......... Hyrum’s law?


Yeah, I only noticed when it was too late. I was drawn to the first quote, rather than the block of text next to the author.


(Also in title and headline.)


> I can only imagine someone going to great lengths to avoid such "a stable order of operations was never guaranteed" discussion by just randomizing the order of execution or something similar (I bet someone will then use that as a seed for prng).

Bingo. https://news.ycombinator.com/item?id=34631275#34636529


The issue is that they didn't look closely at what their users / API consumers were actually doing. Even a cursory look at CI, packaging systems etc would have seen that those were expecting the hashes to be stable. If they'd done that early enough they might have been able to plan a transition to unstable hashes, or at least been able to emphasise the problem in documentation.


However, you can force through the change if you are Google, GitHub, or Cloudflare. The users will still complain, but where will they go?


Sourceforge?


I don't think this is really an example of Hyrum's law. Hyrum's law claims that even if you carefully document your contract, someone will rely on the observable behaviour rather than the documentation anyway.

But this is an example of a much weaker proposition: if you don't document your contract, then people will guess what the contract is and some of them will guess wrong.

(In fact in this case it seems it's more like "if you don't document your contract and your support staff sometimes say the behaviour is A, people will rely on the behaviour being A".)


I wonder if transfer encoding the archive might be a better strategy. The client benefits from a stable format (tar) provided it’s generated in a stable order which generally easier for the server to guarantee. The network transfer occurs transparently compressed (transfer-encoding header in http parlance).

Checksums still work and protect against malicious tarballs which are generally riskier to unpack than plain steam compression / decompression. The server and client gets the smaller file transfers and compression improvements can evolve transparently by negotiating the transfer encoding. The server can still cache the encoded form to avoid needing to compress the same file repeatedly.

Seems like a win win solution without requesting a drastic redesign of package managers everywhere and everyone walks away having won the properties of the system they value.


Could just calculate checksum for decompressed archive at that point. Still want to store it on server/client as compressed file and there is no point making transfer side more complex.

Would probably want to store expected file size together with checksum to avoid the "compressed stream of endless zeroes" attack vector


To be fair, the transfer side is already HTTP which already has the transfer encoding support (client and server) so it should be transparent as l long as standards compliant clients are in use (the default ones always are in terms of transparent decoding) and CDNs are perfectly capable of caching the compressed response afaik.

The main simplification is that there’s less work on the current side at scale - the file you download is the file you checksum. That’s different if package maintainers have to do it manually for each package.

Can you say more about the endless zeroes attack? Are you thinking about finding a sha256 collision? You have to keep computing the sha256 for every new byte which is expensive. And if that ever becomes practical, the ecosystem will switch to 384, 512 or 512/256. But sure, storing file size + hash is generally a good idea to make it that much harder (in practice no one bothers and this advice would apply regardless of compression or not because the expensive bit is the digest computation to find a collision)


> Hyrum's law

Didn't Google beat Hyrum's law by using their weight to force middleboxes to accept some variation in some datum of an http header or something?

Edit: hint: something about rotating a value for some number of decades. Either forcing the hand of middleboxes or CAs, I can't remember. In either case, it seemed like a real pain in the ass to keep the API observability concrete from hardening. :)


You might be thinking of GREASE for TLS: https://chromestatus.com/feature/6475903378915328


Yep, thanks! So not middleboxes or CAs, but old rusty servers.

The other example of evading Hyrum's Law that comes to mind was when early Javascript users of JSON observed how they could intersperse comments.

Crocker said he noticed people were using the comments to stuff preprocessing directives into JSON.

He then devised the most ingenious hack: He told people they weren't allowed to put comments in JSON. Then people stopped putting comments in JSON.

I'm starting to wonder whether Hyrum's Law is really more of a suggestion. :)


That's the advantage of being Google, GitHub, or Cloudflare.


From the post: "it may well become necessary for anybody who wants consistent results to decompress archive files before checking checksums."

I'm certain there's some exploit waiting to subvert the decompress algorithm and substitute malicious content in place of the actual archive files.


Depending on the format, yes. Usually it is by exhausting some resource, such as a file that decompresses to an impossibly large dummy file. If the dummy file crashes an analyzer of the compressed archive, then other malicious files could be hidden.

https://en.wikipedia.org/wiki/Zip_bomb


Even before the untar phase, couldn't there be a vulnerability in:

- your HTTPS stack - gzip encoded HTTP - your sha256 program


You'd probably want to limit the size of decompression so no 10TB of zeroes compressed to small size. But that's not that hard


Can't Github just keep the old archive as it is for the already-existing releases and use the new format for new releases? Over time old releases phase out and the advantage of the new format is completely in effect. You can even use a time-based cut-off date if you somehow want to get it in sync.


The article explicitly says, "Internally, this archive is created at request time by the `git archive` subcommand". In other words, there is no pre-existing archive and apparently no cache of generated archives. Which means a request for an archive gets one generated with whatever format is in effect at that moment.

Why github doesn't cache archives instead of regenerating them on the fly is unclear, and maybe something the developers should address. Or maybe there was a cache and it got blown away by the change that caused the archive checksums to change.


Yes, they can. The thing is that tarballs are not part of release artifacts, even though they do appear to be among them. If you look closely, even the endpoints that user uploaded artifacts and generated tarballs point to are different.

Github could just generate the tarball once and store it in the same way as other release artifacts. But for some reason they chose not to.


Some reason? Storage is not free after all and many tarballs won't be fetched frequently I'd assume.


To clarify: I am proposing to use the old archival method for already-existing releases, for example by passing in the necessary arguments to git archive forever. Releases without previous archival or created after some point in time in the future call git archive with the new arguments. No substantially more storage should be necessary.

Once the accompanying gzip-version is sufficiently old or unsafe, breaking very old releases does not matter anymore, making it a seamless transition.


> Can't Github just…You can even…

In my experience this sort of simplistic proposal by an outsider is almost always born of ignorance about the complexities of the actual system.


Isn't that why it's a question?


I'd imagine they create it on demand and just cache for some time. That's way less storage needed than having every single commit be also a tarball stored somewhere


Mentioned in the article


HANG ON!

I think this just made me realise an issue I was having with Swift Package Manager a few months back. We have a bunch of ObjC frameworks in our app that we don't want people to update anymore so we can rewrite them, and we just threw them all into a big umbrella project, but for some reason we couldn't get the binary target URL from Github Enterprise to work on our self hosted Enterprise instance because the checksum would be different every time, but it worked perfectly for Github Cloud.

Is there anyone from Github here - Can you confirm that is the cause of issue for GH Enterprise?


Might be related to https://github.blog/changelog/2023-01-30-git-archive-checksu... somehow, though that was only live for a couple of days last month I thought.


Had to follow the links to figure out what Hyrum’s Law was (I like laws). The best link from that law is the obligatory xkcd at the very bottom. Reshared here:

https://xkcd.com/1172/


I found a good video about this too https://lwn.net/SubscriberLink/921787/949cf79f2599f734/ (Original Post) --> https://www.hyrumslaw.com/ --> https://twitter.com/hyrumwright --> https://twitter.com/dret/status/1573897062785032192 --> https://www.youtube.com/watch?v=5Wdgjw6IGDM (Hyrum's Law: Hyrum Wright on Programming over Time - Interview of by Erik Wilde)


Couldn’t they include two checksums; one for compressed, and if that fails, decompress and check the uncompressed content?


Typically when working with automated deployment/build systems you don't want to decompress unknown data for security reasons. Checking the checksum of the compressed content solves that issue.


Mob engineering: you don’t have to read the documentation if a million other people also do not.


I think that's uncharitable. Almost no one realized these things were being generated. We all assumed that links to github's "releases" were just links to files because they look like links to files! Here's one to Zephyr 3.2.0: https://github.com/zephyrproject-rtos/zephyr/archive/refs/ta...

You pull that and get a tarball that is presented to the world as an "official release". Looks like a file. Acts like a file. It's a file.

So now your package manager or reproducible build engine or whatever needs a reference to the "official source code release", and what do you point it to? That file, obviously. It's right there on the "release" page for the download. And of course you checksum it for security, because duh.

Then last week all of a sudden that file changed! Sure, it has the same contents. But the checksum that you computed in good faith based on the official release tarball doesn't match!

If there's a misunderstanding here, it's on github and not the users. They can't be providing official release tarballs if they won't guarantee consistency. "As documented", this feature was a huge footgun[1]. That's bad.

[1] Actually it's worse: as documented it's basically useless. If you can't externally validate the results of that archive file, then the only way to use it is to tell your users that they have to trust Microsoft not to do anything bad, because you can't make any promise about the file that they can verify!


The contents of the archive could be verified rather than the archive itself, since that's what needs to not change. If the compression level changed the archives would also have a different checksum, but no one would say they require archive's to always have a specified compression level.

The fact it looked like an immutable file is much more relevant though.


This is entirely right. The feature as it exists is insane if it doesn't guarantee consistent hashes. However, there are alternatives even in the face of an adversarial GitHub. Every software project could agree on a manifest format and some kind of PKI/WoT to distribute certificates.

Pigs would fly first, but it's possible!


> We all assumed that

Uh huh.


In this case, it seems that GitHub was asked about it. From the thread linked in the article:

> After a fruitful exchange with GitHub support staff, I was able to confirm the following (quoting with their permission):

>> I checked with our team and they confirmed that we can expect the checksums for repository release archives, found at /archive/refs/tags/$tag, to be stable going forward. That cannot be said, however, for repository code download archives found at archive/v6.0.4.

>> It's totally understandable that users have come to expect a stable and consistent checksum value for these archives, which would be the case most of the time. However, it is not meant to be reliable or a way to distribute software releases and nothing in the software stack is made to try to produce consistent archives. This is no different from creating a tarball locally and trying verify it with the hash of the tarball someone created on their own machine.

>> If you had only a tag with no associated release, you should still expect to have a consistent checksum for the archives at /archive/refs/tags/$tag.

> In summary: It is safe to reference archives of any kind via the /refs/tags endpoint, everything else enjoys no guarantees.

(posted 4 Feb 2022)

https://github.com/bazel-contrib/SIG-rules-authors/issues/11...

There's even a million linked PRs and issues where people went around and specifically updated their code to point to the URLs that were, nominally, stable.

I suspect that the GH employee who made these comments just misunderstood how these archives were being generated, or the behavior was depending on some internal implementation detail that got wiped away at some point. But if an employee at a big-ass company publicly says "yeah that's supported" to employees at another big-ass company, people are gonna take it as somewhat official.


2018 Gentoo-dev called, wants to let you know this is old news: https://www.mail-archive.com/gentoo-dev@lists.gentoo.org/msg...


Indeed. The proper thing (also read as: the friendliest way for distro packagers) is for software projects to generate and publish a tarball themselves as part of their tag+release process.

That provides multiple advantages. Unlike GitHub’s unreliable automatically generated files, a fixed file can be hashed or cryptographically signed by the project (with SSH signatures, Signify, PGP, etc.), and later verified without having to extract the files first or check out the underlying repo.

Another thing many projects aren’t aware of: if your project uses Git submodules, anyone using GitHub’s autogenerated tarballs will be unable to build your software, because those don’t contain submodules.


> Unlike GitHub’s unreliable automatically generated files, a fixed file can be hashed or cryptographically signed by the project (with SSH signatures, Signify, PGP, etc.), and later verified without having to extract the files first

Or how about this: Microsoft could provide that as a feature in their "official release" page for projects, which is exactly what we all thought that page was for in the first place.

Seriously: if archive links are unreliable they're basically useless anyway. Who wants tarballs in the modern world except for package management or build automation?


People who want the source code but don't have `git`. It's also why downloading as a ZIP is also an option.


People who want the source code, don't have git, and won't get git in order to get the source code, sounds like the empty set. Though zip makes me think of windows, so maybe over there. The world of perforce.


Why?

I don't need git for my local personal use of the software, but I do need the source code if I need to modify the software.

Some people even downloaded source code even before git existed.


You need a compiler and a bunch of other build tools. Why not git?


What about people who don't have Git and want to build it from source?


> Indeed. The proper thing (also read as: the friendliest way for distro packagers) is for software projects to generate and publish a tarball themselves as part of their tag+release process.

And this is easy enough to do automatically with GitHub actions, I have a workflow [1] which runs on each release to create a stable archive of the source and attaches it to the release.

[1] https://github.com/elesiuta/picosnitch/blob/master/.github/w...


The proper thing is for the software build processes that rely on tarballs from GitHub to switch to using git directly; either by shallow clone or storing a full repo and checking out worktrees as appropriate. Tarballs at a tagged revision are fine as release artifacts if your upstream is publishing them as release artifacts, but the whole point of this is that they aren't.


> And no, a git tag is not a release.

Why not?


> more easily support compression across operating systems

I cannot help but wonder if this change was forced upon github by Microsoft because gzip is GPL 3, maybe this other version is a clean room clone. We all know corporations hate GPLv3, including the large corporation I work for.

https://www.gnu.org/software/gzip/


First of all the change was made upstream in git, which is not controlled by GitHub (even though GitHub does have some developers who work on git). And the stated reason (not relying on third party tools/libraries) is compatible with many other changes made to git over its history, e.g. the conversion of many git commands from Perl to C.

Furthermore, gzip isn't even necessarily the best tool to produce gzip data. If you want multi-core parallelism there's pigz, and if you're willing to trade higher CPU usage to get a better compression ratio you can use zopfli. I don't know the details of the implementation in git and whether it tries to leverage multi-threading or zopfli-like techniques, but the point stands that gzip isn't the final word on producing gzip data.


It was git that implemented the change, then github upgraded to the affected version. AFAIK, MS has no influence over upstream git.

As much as I distrust Microsoft, I don't think there were any ulterior motives here.


IIRC the change was made by git-for-windows developer. And some of top contributors for git are from Github/Microsoft.


If this were true, this would have been a problem a long time ago. Why would Microsoft wait such a long time to change this when under your assumption it would have been a continuous legal liability?


Why would that matter for an internal tool? Esp. there's plenty of much more visible GPLv3 bits in WSL images that ship via the Windows app store.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: