Hi HN, author here. If anyone knows why layer pushes need to be sequential in the OCI specification, please tell! Is it merely a historical accident, or is there some hidden rationale behind it?
Edit: to clarify, I'm talking about sequentially pushing a _single_ layer's contents. You can, of course, push multiple layers in parallel.
It makes clean-up simpler - if you never got to the "last" one, it's obvious you didn't finish after N+Timeout and thus you can expunge it. It simplifies an implementation detail (how do you deal with partial uploads? make them easy to spot). Otherwise you basically have to trigger at the end of every chunk, see if all the other chunks are there and then do the 'completion'.
But that's an implementation detail, and I suspect isn't one that's meaningful or intentional. Your S3 approach should work fine btw, I've done it before in a prior life when I was at a company shipping huge images and $.10/gb/month _really_ added up.
You lose the 'bells and whistles' of ECR, but those are pretty limited (imho)
It's been a long time, but I think you're correct. In my environment I didn't actually care (any failed push would be retried so the layers would always eventually complete, and anything that for whatever reason didn't retry, well, it didn't happen enough that we cared at the cost of S3 to do anything clever).
I think OCI ordered manifests first to "open the flow", but then close is only when the manifests last entry was completed - which led to this ordered upload problem.
If your uploader knows where the chunks are going to live (OCI is more or less CAS, so it's predictable), it can just put them there in any order as long as it's all readable before something tries to pull it.
Never dealt with pushes, but it’s nice to see this — back when Docker was getting started I dumped an image behind nginx and pulled from that because there was no usable private registry container, so I enjoyed reading your article.
Source: I have implemented a OCI-compliant registry [1], though for the most part I've been following the behavior of the reference implementation [2] rather than the spec, on account of its convolutedness.
When the client finalizes a blob upload, they need to supply the digest of the full blob. This requirement evidently serves to enable the server side to validate the integrity of the supplied bytes. If the server only started checking the digest as part of the finalize HTTP request, it would have to read back all the blob contents that had already been written into storage in previous HTTP requests. For large layers, this can introduce an unreasonable delay. (Because of specific client requirements, I have verified my implementation to work with blobs as large as 150 GiB.)
Instead, my implementation runs the digest computation throughout the entire sequence of requests. As blob data is taken in chunk by chunk, it is simultaneously streamed into the digest computation and into blob storage. Between each request, the state of the digest computation is serialized in the upload URL that is passed back to the client in the Location header. This is roughly the part where it happens in my code: https://github.com/sapcc/keppel/blob/7e43d1f6e77ca72f0020645...
I believe that this is the same approach that the reference implementation uses. Because digest computation can only work sequentially, therefore the upload has to proceed sequentially.
> For the last four months I’ve been developing a custom container image builder, collaborating with Outerbounds
I know you said this was something for another blog post but could you already provide some details? Maybe a link to a GitHub repo?
Background: I'm looking for (or might implement myself) a way to programmatically build OCI images from within $PROGRAMMING_LANGUAGE. Think Buildah, but as an API for an actual programming language instead of a command line interface. I could of course just invoke Buildah as a subprocess but that seems a bit unwieldy (and I would have to worry about interacting with & cleaning up Buildah's internal state), plus Buildah currently doesn't support Mac.
Unfortunately, all the code is proprietary at the moment. If you are willing to get your hands dirty, the main thing to realize is that container layers are "just" tar files (see, for instance, this article: https://ochagavia.nl/blog/crafting-container-images-without-...). Contact details are in my profile, in case you'd like to chat ;)
Thanks for the link! Though I'm less worried about the tarball / OCI spec part, more about platform compatibility. I tried running runc/crun by hand at some point and let's just say I've done things before that were more fun. :)
I can't think of an obvious one, maybe load based?
~~I added parallel pushes to docker I think, unless I'm mixing up pulls & pushes, it was a while ago.~~ My stuff was around parallelising the checks not the final pushes.
Edit - does a layer say which layer it goes "on top" of? If so perhaps that's the reason, so the IDs of what's being pointed to exist.
Layers are fully independent of each other in the OCI spec (which makes them reusable). They are wired together through a separate manifest file that lists the layers of a specific image.
It's a mystery... Here are the bits of the OCI spec about multipart pushes (https://github.com/opencontainers/distribution-spec/blob/58d...). In short, you can only upload the next chunk after the previous one finishes, because you need to use information from the response's headers.
If you've got plenty of time for the build, you can. Make a two-stage build where the first stage installs Python and pytorch, and the second stage does ten COPYs which each grab 1/10th of the files from the first stage. Now you've got ten evenly sized layers. I've done this for very large images (lots of Python/R/ML crap) and it takes significant extra time during the build but speeds up pulls because layers can be pulled in parallel.
Surely you can have one layer per directory or something like that? Splitting along those lines works as long as everything isn't in one big file.
I think it was a mistake to make layers as a storage model visible in to the end user. This should just have been an internal implementation detail, perhaps similar to how Git handles delta compression and makes it independent of branching structure. We also should have delta pushes and pulls, using global caches (for public content), and the ability to start containers while their image is still in transfer.
It should be possible to split into multiple layers as long as each file is wholly within in its layer. This is the exact opposite of the work recommended combining commands to keep everything in one layer which I think is done ultimately for runtime performance reasons.
I've dug fairly deep into docker layering, it would be wonderful if there was a sort of `LAYER ...` barrier instead of implicitly via `RUN ...` lines.
Theoretically there's nothing stopping you from building the docker image and "re-layering it", as they're "just" bundles of tar files at the end of the day.
In true "when all you have is a hammer" fashion, as very best I can tell that syntax= directive is pointing to a separate docker image whose job it is to read the file and translate it into builtkit api calls, e.g. https://github.com/moby/buildkit/blob/v0.15.0/frontend/docke...
But, again for clarity: I've never tried such a stunt, that's just the impression I get from having done mortal kombat with builtkit's other silly parts
Thanks, that helps a lot and I didn't know about it:) It's a touch less powerful than full transactions (because AFAICT you can't say merge a COPY and RUN together) but it's a big improvement.
Edit: to clarify, I'm talking about sequentially pushing a _single_ layer's contents. You can, of course, push multiple layers in parallel.