Hi HN, author here. If anyone knows why layer pushes need to be sequential in th...

abofh · 2024-07-12T16:43:16 1720802596

It makes clean-up simpler - if you never got to the "last" one, it's obvious you didn't finish after N+Timeout and thus you can expunge it. It simplifies an implementation detail (how do you deal with partial uploads? make them easy to spot). Otherwise you basically have to trigger at the end of every chunk, see if all the other chunks are there and then do the 'completion'.

But that's an implementation detail, and I suspect isn't one that's meaningful or intentional. Your S3 approach should work fine btw, I've done it before in a prior life when I was at a company shipping huge images and $.10/gb/month _really_ added up.

You lose the 'bells and whistles' of ECR, but those are pretty limited (imho)

orf · 2024-07-12T18:27:51 1720808871

In the case of a docker registry, isn’t the “final bit” just uploading the final manifest that actually references the layers you’re uploading?

At this point you’d validate that the layers exist and have been uploaded, otherwise you’d just bail out?

And those missing chunks would be handled by the normal registry GC, which evicts unreferenced layers?

abofh · 2024-07-13T09:02:45 1720861365

It's been a long time, but I think you're correct. In my environment I didn't actually care (any failed push would be retried so the layers would always eventually complete, and anything that for whatever reason didn't retry, well, it didn't happen enough that we cared at the cost of S3 to do anything clever).

I think OCI ordered manifests first to "open the flow", but then close is only when the manifests last entry was completed - which led to this ordered upload problem.

If your uploader knows where the chunks are going to live (OCI is more or less CAS, so it's predictable), it can just put them there in any order as long as it's all readable before something tries to pull it.

rcarmo · 2024-07-12T08:01:56 1720771316

Never dealt with pushes, but it’s nice to see this — back when Docker was getting started I dumped an image behind nginx and pulled from that because there was no usable private registry container, so I enjoyed reading your article.

majewsky · 2024-07-16T11:53:46 1721130826

Source: I have implemented a OCI-compliant registry [1], though for the most part I've been following the behavior of the reference implementation [2] rather than the spec, on account of its convolutedness.

When the client finalizes a blob upload, they need to supply the digest of the full blob. This requirement evidently serves to enable the server side to validate the integrity of the supplied bytes. If the server only started checking the digest as part of the finalize HTTP request, it would have to read back all the blob contents that had already been written into storage in previous HTTP requests. For large layers, this can introduce an unreasonable delay. (Because of specific client requirements, I have verified my implementation to work with blobs as large as 150 GiB.)

Instead, my implementation runs the digest computation throughout the entire sequence of requests. As blob data is taken in chunk by chunk, it is simultaneously streamed into the digest computation and into blob storage. Between each request, the state of the digest computation is serialized in the upload URL that is passed back to the client in the Location header. This is roughly the part where it happens in my code: https://github.com/sapcc/keppel/blob/7e43d1f6e77ca72f0020645...

I believe that this is the same approach that the reference implementation uses. Because digest computation can only work sequentially, therefore the upload has to proceed sequentially.

[1] https://github.com/sapcc/keppel [2] https://github.com/distribution/distribution

codethief · 2024-07-12T16:38:25 1720802305

Hi, thanks for the blog post!

> For the last four months I’ve been developing a custom container image builder, collaborating with Outerbounds

I know you said this was something for another blog post but could you already provide some details? Maybe a link to a GitHub repo?

Background: I'm looking for (or might implement myself) a way to programmatically build OCI images from within $PROGRAMMING_LANGUAGE. Think Buildah, but as an API for an actual programming language instead of a command line interface. I could of course just invoke Buildah as a subprocess but that seems a bit unwieldy (and I would have to worry about interacting with & cleaning up Buildah's internal state), plus Buildah currently doesn't support Mac.

throwawaynorway · 2024-07-12T20:54:23 1720817663

If $PROGRAMMING_LANGUAGE = go, you might be looking for https://github.com/containers/storage which can create layers, images, and so on. I think `Store` is the main entry: https://pkg.go.dev/github.com/containers/storage#Store

Buildah uses it: https://github.com/containers/buildah/blob/main/go.mod#L27C2...

Edit: buildkit seems to be the same, used by docker, but needs a daemon?

wofo · 2024-07-12T18:36:15 1720809375

Unfortunately, all the code is proprietary at the moment. If you are willing to get your hands dirty, the main thing to realize is that container layers are "just" tar files (see, for instance, this article: https://ochagavia.nl/blog/crafting-container-images-without-...). Contact details are in my profile, in case you'd like to chat ;)

codethief · 2024-07-12T18:56:42 1720810602

Ah too bad :)

Thanks for the link! Though I'm less worried about the tarball / OCI spec part, more about platform compatibility. I tried running runc/crun by hand at some point and let's just say I've done things before that were more fun. :)

cpuguy83 · 2024-07-12T20:47:38 1720817258

This is what buildkit is. Granted go has the only sdk I know of, but the api is purely protobuf and highly extensible.

cpuguy83 · 2024-07-12T20:49:34 1720817374

For that matter, dagger (dagger.io) provides an sdk in multiple languages and gives you the full power (and then some extra on top) of buildkit.

IanCal · 2024-07-12T08:42:08 1720773728

I can't think of an obvious one, maybe load based?

~~I added parallel pushes to docker I think, unless I'm mixing up pulls & pushes, it was a while ago.~~ My stuff was around parallelising the checks not the final pushes.

Edit - does a layer say which layer it goes "on top" of? If so perhaps that's the reason, so the IDs of what's being pointed to exist.

wofo · 2024-07-12T08:47:49 1720774069

Layers are fully independent of each other in the OCI spec (which makes them reusable). They are wired together through a separate manifest file that lists the layers of a specific image.

It's a mystery... Here are the bits of the OCI spec about multipart pushes (https://github.com/opencontainers/distribution-spec/blob/58d...). In short, you can only upload the next chunk after the previous one finishes, because you need to use information from the response's headers.

IanCal · 2024-07-12T08:51:12 1720774272

Ah thanks.

That's chunks of a single layer though, not multiple layers right?

wofo · 2024-07-12T08:53:36 1720774416

Indeed, you are free to push multiple layers in parallel. But when you have a 1 GiB layer full of AI/ML stuff you can feel the pain!

(I just updated my original comment to make clear I'm talking about single-layer pushes here)

killingtime74 · 2024-07-12T09:54:28 1720778068

Split the layer up?

thangngoc89 · 2024-07-12T10:52:08 1720781528

You can’t. Installing pytorch and supporting dependencies takes 2.2GB on debian-slim.

electroly · 2024-07-12T15:08:31 1720796911

If you've got plenty of time for the build, you can. Make a two-stage build where the first stage installs Python and pytorch, and the second stage does ten COPYs which each grab 1/10th of the files from the first stage. Now you've got ten evenly sized layers. I've done this for very large images (lots of Python/R/ML crap) and it takes significant extra time during the build but speeds up pulls because layers can be pulled in parallel.

thangngoc89 · 2024-07-13T13:37:40 1720877860

I see your point on the pull speed. Most of my pulls are stuck at waiting for the pytorch/dependencies layer.

This might work with pip but I absolutely hate pip and using poetry with great success. I will investigate how to do this with poetry.

fweimer · 2024-07-12T11:10:23 1720782623

Surely you can have one layer per directory or something like that? Splitting along those lines works as long as everything isn't in one big file.

I think it was a mistake to make layers as a storage model visible in to the end user. This should just have been an internal implementation detail, perhaps similar to how Git handles delta compression and makes it independent of branching structure. We also should have delta pushes and pulls, using global caches (for public content), and the ability to start containers while their image is still in transfer.

password4321 · 2024-07-12T11:14:18 1720782858

It should be possible to split into multiple layers as long as each file is wholly within in its layer. This is the exact opposite of the work recommended combining commands to keep everything in one layer which I think is done ultimately for runtime performance reasons.

ramses0 · 2024-07-12T14:06:10 1720793170

I've dug fairly deep into docker layering, it would be wonderful if there was a sort of `LAYER ...` barrier instead of implicitly via `RUN ...` lines.

Theoretically there's nothing stopping you from building the docker image and "re-layering it", as they're "just" bundles of tar files at the end of the day.

eg: `RUN ... ; LAYER /usr ; LAYER /var ; LAYER /etc ; LAYER [discard|remainder]`

yjftsjthsd-h · 2024-07-12T15:37:42 1720798662

I've wished for a long time that Dockerfiles had an explicit way to define layers ripped off from (postgre)sql:

    BEGIN
    RUN foo
    RUN bar
    COMMIT

mdaniel · 2024-07-12T16:11:57 1720800717

At the very real risk of talking out of my ass, the new versioned Dockerfile mechanism on top of builtkit should enable you to do that: https://github.com/moby/buildkit/blob/v0.15.0/frontend/docke...

In true "when all you have is a hammer" fashion, as very best I can tell that syntax= directive is pointing to a separate docker image whose job it is to read the file and translate it into builtkit api calls, e.g. https://github.com/moby/buildkit/blob/v0.15.0/frontend/docke...

But, again for clarity: I've never tried such a stunt, that's just the impression I get from having done mortal kombat with builtkit's other silly parts

skrause · 2024-07-12T17:23:49 1720805029

    RUN <<EOF
    foo
    bar
    EOF

https://www.docker.com/blog/introduction-to-heredocs-in-dock...

yjftsjthsd-h · 2024-07-12T19:06:29 1720811189

Thanks, that helps a lot and I didn't know about it:) It's a touch less powerful than full transactions (because AFAICT you can't say merge a COPY and RUN together) but it's a big improvement.