Using S3 as a Container Registry

stabbles · 2024-07-12T08:57:16 1720774636

The OCI Distribution Spec is not great, it does not read like a specification that was carefully designed.

> According to the specification, a layer push must happen sequentially: even if you upload the layer in chunks, each chunk needs to finish uploading before you can move on to the next one.

As far as I've tested with DockerHub and GHCR, chunked upload is broken anyways, and clients upload each blob/layer as a whole. The spec also promotes `Content-Range` value formats that do not match the RFC7233 format.

(That said, there's parallelism on the level of blobs, just not per blob)

Another gripe of mine is that they missed the opportunity to standardize pagination of listing tags, because they accidentally deleted some text from the standard [1]. Now different registries roll their own.

[1] https://github.com/opencontainers/distribution-spec/issues/4...

eadmund · 2024-07-12T10:53:32 1720781612

> The OCI Distribution Spec is not great, it does not read like a specification that was carefully designed.

That’s par for everything around Docker and containers. As a user experience Docker is amazing, but as technology it is hot garbage. That’s not as much of a dig on it as it might sound: it really was revolutionary; it really did make using Linux namespaces radically easier than they had ever been; it really did change the world for the better. But it has always prioritised experience over technology. That’s not even really a bad thing! Just as there are tons of boring companies solving expensive problems with Perl or with CSVs being FTPed around, there is a lot of value in delivering boring or even bad tech in a good package.

It’s just sometimes it gets sad thinking how much better things could be.

steve1977 · 2024-07-12T15:48:17 1720799297

> it really did change the world for the better.

I don’t know about that (hyperbole aside). I’ve been in IT for more than 25 years now. I can’t see that Docker container actually delivered any tangible benefits in terms of end-product reliability or velocity of development to be honest. This might not necessarily be Dockers fault though, maybe it’s just that all the potential benefits get eaten up by things like web development frameworks and Kubernetes.

But at the end of the day, todays Docker-based web app development delivers less than fat-client desktop app development delivered 20 years ago, as sad as that is.

9dev · 2024-07-12T16:51:23 1720803083

If you haven’t seen the benefits, you’re not in the business of deploying a variety of applications to servers.

The fact that I don’t have to install dependencies on a server, or set up third-party applications like PHP, Apache, Redis, and the myriad of other packages anymore, or manage config files in /etc, or handle upgrades of libc gracefully, or worry about rolling restarts and maintenance downtime… all of this was solvable before, but has become radically easier with containers.

Packaging an application and its dependencies into a single, distributable artifact that can be passed around and used on all kinds of machines was a glorious success.

PaulHoule · 2024-07-12T18:18:22 1720808302

Circa 2005 I was working at places where I was responsible for 80 and 300 web sites respectively using a large range of technologies. On my own account I had about 30 domain names.

I had scripts that would automatically generate the Apache configuration to deploy a new site in less than 30 seconds.

At that time I found that most web sites have just a few things to configure: often a database connection, the path to where files are, and maybe a cryptographic secret. If you are systematic about where you put your files and how you do your configuration running servers with a lot of sites is about as easy as falling off a log, not to mention running development, test, staging, prod and any other sites you need.

I have a Python system now with gunicorn servers and celery workers that exists in three instances on my PC, because I am disciplined and everything is documented I could bring it up on another machine manually pretty quickly, probably more quickly than I could download 3GB worth of docker images over my ADSL connection. With a script it would be no contest.

There also was a time I was building AMIs and even selling them on the AMZN marketplace and the formula was write a Java program that writes a shell script that an EC2 instance runs on boot, when it is done it sends a message through SQS to tell the Java program to shut down and image the new machine.

If Docker is anything it is a system that turns 1 MB worth of I/O into 1 GB of I/O. I found Docker was slowing me down when I was using a gigabit connection, I found it basically impossible to do anything with it (like boot up an image) on a 2MB/sec ADSL connection, with my current pair of 20MB/s connections it is still horrifyingly slow.

I like how the OP is concerned about I/O speed and bringing it up and I think it could be improved if there was a better cache system (e.g. Docker might even work on slow ADSL if it properly recovered from failed downloads)

However I think Docker has a conflict between “dev” (where I’d say your build is slow if you ever perceive yourself to be waiting) and “ops” (where a 20 minute build is “internet time”)

I think ops is often happy with Docker, some devs really seem to like it, but for some of us it is a way to make a 20 sec task a 20 minute task.

cogman10 · 2024-07-12T19:58:51 1720814331

And I'm guessing with this system you had a standard version of python, apache, and everything else. I imagine that with this system if you wanted to update to the latest version of python, in involved a long process making sure those 80 or 300 websites didn't break because of some random undocumented breaking change.

As for docker image size, really just depends on dev discipline for better or for worse. The nginx image, for example, adds about 1MB of data on top of the whatever you did with your website.

belthesar · 2024-07-12T21:24:00 1720819440

You hit a few important notes that are worth keeping in mind, but I think you handwave some valuable impacts.

By virtue of shipping around an entire system's worth of libraries as a deployment artifact, you are indeed drastically increasing the payload size. It's easy to question whether payload efficiency is worthwhile when the advent of >100, and even >1000 Mbit internet connections available to the home, but that is certainly not the case everywhere. That said, assuming smart squashing of image deltas and basing off of a sane upstream image, much of that pain is felt only once.

You bring up that you built a system that helped you quickly and efficiently configure systems, and that discipline and good systems design can bring many of the same benefits that containerized workloads do. No argument! What the Docker ecosystem provided however was a standard implemented in practice that became ubiquitous. It became less important to need to build one's own system, because the container image vendor could define that, using a collection of environment variables or config files being placed in a standardized location.

You built up a great environment, and one that works well for you. The containerization convention replicates much of what you developed, with the benefit that it grabbed a majority mindshare, so now many more folks are building with things like standardization of config, storage, data, and environment in mind. It's certainly not the only way to do things, and much as you described, it's not great in your case. But if something solves a significant amount of cases well, then it's doing something right and well. For a non inconsequential amount of people, trading bandwidth and storage for operational knowledge and complexity are a more than equitable trade

mnahkies · 2024-07-12T20:12:15 1720815135

Agreed, I remember having to vendor runtimes to my services because we couldn't risk upgrading the system installed versions with the number of things running on the box, which then led into horrible hacks with LD_PRELOAD to workaround a mixture of OS / glibc version's in the fleet. Adding another replica of anything was a pain.

Now I don't have to care what OS the host is running, or what dependencies are installed, and adding replicas is either automatic or editing a number in a config file.

Containerization and orchestration tools like k8s have made life so much easier.

nitwit005 · 2024-07-13T01:03:14 1720832594

As you note, it was all solvable before.

A lot of us were just forced to "switch" from VMs to Docker; Docker that still got deployed to a VM.

And then we got forced to switch to podman as they didn't want to pay for Docker.

9dev · 2024-07-13T07:47:56 1720856876

> As you note, it was all solvable before.

Washing clothes was possible before people had a washing machine, too; I’m not sure they would want to go back to that, though.

I was there in the VM time, and I had to set up appliances shipped as a VM instance. It was awful. The complexity around updates and hypervisors, and all that OS adjustment work just to get a runtime environment going, that just disappeared with Docker (if done right, I’ll give you that).

Organisations manage to abuse technology all the time. Remember when Roy Fielding wrote about using HTTP sensibly to transfer state from one system to another? Suddenly everything had to be „RESTful“, which for most people just meant that you tried to use as many HTTP verbs as possible and performed awkward URL gymnastics to get speaking resource identifiers. Horrible. But all of this doesn’t mean REST is a bad idea of itself - it’s a wonderful one, in fact, and can make an API substantially easier to reason about.

steve1977 · 2024-07-12T16:55:56 1720803356

I’m aware of all of that, I’m just saying that this has not translated into more reliable and better software in the end, interestingly enough. As said, I’m not blaming Docker, at least not directly. It’s more that the whole “ecosystem” around it seems to have so many disadvantages that in the end overweigh the advantages of Docker.

derefr · 2024-07-12T17:00:38 1720803638

It has translated to reliable legacy software. You can snapshot a piece of software, together with its runtime environment, at the point when it's still possible to build it; and then you can continue to run that built OCI image, with low overhead, on modern hardware — even when building the image from scratch has long become impossible due to e.g. all the package archives that the image fetched from going offline.

(And this enables some increasingly wondrous acts of software archaeology, due to people building OCI images not for preservation, but just for "use at the time" — and then just never purging them from whatever repository they've pushed them to. People are preserving historical software builds in a runnable state, completely by accident!)

Before Docker, the nearest thing you could do to this was to package software as a VM image — and there was no standard for what "a VM image" was, so this wasn't a particularly portable/long-term solution. Often VM-image formats became unsupported faster than the software held in them did!

But now, with OCI images, we're nearly to the point where we've e.g. convinced academic science to publish a paper's computational apparatus as an OCI image, so that it can be pulled 10 years later when attempting to replicate the paper.

steve1977 · 2024-07-12T17:25:15 1720805115

> You can snapshot a piece of software, together with its runtime environment, at the point when it's still possible to build it

I think you’re onto part of the problem here. The thing is that you have to snapshot a lot of nowadays software together with its runtime environment.

I mean, I can still run Windows software (for example) that is 10 years or older without that requirement.

9dev · 2024-07-12T17:27:40 1720805260

The price for that kind of backwards compatibility is a literal army of engineers working for a global megacorporation. Free software could not manage that, so having a pragmatic way to keep software running in isolated containers seems like a great solution to me.

steve1977 · 2024-07-12T17:38:10 1720805890

There’s an army of developers working on Linux as well, employed by companies like IBM and Oracle. I don’t see a huge difference to Microsoft here to be honest.

sangnoir · 2024-07-12T18:56:24 1720810584

You'd have a better time working with Windows 7 than a 2.x Linux kernel. I love Linux, but Microsoft has longer support Windows for its operating systems.

ahnick · 2024-07-12T18:25:31 1720808731

What are you even talking about? Being able to run 10 year old software (on any OS) is orthogonal to being able to build a piece software whose dependencies are completely missing. Don't pretend like this doesn't happen on Windows.

steve1977 · 2024-07-12T18:45:24 1720809924

My point was that a lot of older software, especially desktop apps, did not have such wild dependencies. Therefore this was less of an issue. Today with Python and with JavaScript and its NPM hell it is of course.

sangnoir · 2024-07-12T19:02:45 1720810965

> My point was that a lot of older software, especially desktop apps, did not have such wild dependencies. Therefore this was less of an issue.

Anyone who worked with Perl CGI and CPAN would tell you managing dependencies across environments has always been an issue. Regarding desktop software; the phrase "DLL hell" precedes NPM and pip by decades and is fundamentally the same dependency management challenge that docker mostly solves.

steve1977 · 2024-07-12T19:23:06 1720812186

DLL hell was also essentially fixed decades ago. And rarely as complex as what you see nowadays.

ahnick · 2024-07-12T19:09:53 1720811393

Exactly!

bzmrgonz · 2024-07-13T21:00:07 1720904407

I think the disconnect is in viewing your trees and not viewing the forest. Sure you were a responsible disciplined tree engineer for your acres, but what about the rest of the forest? Can we at least agree that docker made plant husbandry easier for the masses world-wide??

9dev · 2024-07-12T17:32:49 1720805569

Im not sure I would agree here: from my personal experience, the increasing containerisation has definitely nudged lots of large software projects to behave better; they don’t spew so many artifacts all over the filesystem anymore, for example, and increasingly adopt environment variables for configuration.

Additionally, I think lots of projects became able to adopt better tooling faster, since the barrier to use container-based tools is lower. Just think of GitHub Actions, which suddenly enabled everyone and their mother to adopt CI pipelines. That simply wasn’t possible before, and has led to more software adopting static analysis and automated testing, I think.

steve1977 · 2024-07-12T17:41:15 1720806075

This might all be true, but has this actually resulted in better software for end users? More stability, faster delivery of useful features? That is my concern.

watermelon0 · 2024-07-12T18:44:03 1720809843

For SaaS, I'd say it definitely improved and sped up delivery of the software from development machine to CI to production environment. How this translates to actual end users, it's totally up to the developers/DevOps/etc. of each product.

For self-hosted software, be it for business or personal use, it immensely simplified how a software package can be pulled, and run in isolated environment.

Dependency hell is avoided, and you can easily create/start/stop/delete a specific software, without affecting the rest of the host machine.

pyrale · 2024-07-12T16:33:14 1720801994

> But at the end of the day, todays Docker-based web app development delivers less than fat-client desktop app development delivered 20 years ago, as sad as that is.

You mean, aside from not having to handle installation of your software on your users' machines?

Also I'm not sure this is related to docker at all.

steve1977 · 2024-07-13T06:01:01 1720850461

I actually did work in software packaging (amongst other things) around 20 years ago. This was never a huge issue to be honest, neither was deployment.

I know, in theory this stuff all sounds very nice. With web apps, you can "deploy" within seconds ideally, compared to say at least a couple of minutes or maybe hours with desktop software distribution.

But all of that doesn't really matter if the endusers now actually have to wait weeks or months to get the features they want, because all that new stuff added so much complexity that the devs have to handle.

And that was my point. In terms of enduser quality, I don't think we have gained much, if anything at all.

supriyo-biswas · 2024-07-12T16:45:33 1720802733

Being able to create a portable artifact with only the userspace components in it, and that can be shipped and run anywhere with minimal fuss is something that didn't really exist before containers.

docandrew · 2024-07-12T16:52:57 1720803177

Java?

yjftsjthsd-h · 2024-07-12T19:16:53 1720811813

There were multiple ways to do it as long as you stayed inside one very narrow ecosystem; JARs from the JVM, Python's virtualenv, kind of PHP, I think Ruby had something? But containers gave you a single way to do it for any of those ecosystems. Docker lets you run a particular JVM with its JARs, and an exact version of the database behind that application, and the Ruby on Rails in front of it, and all these parts use the same format and commands.

bandrami · 2024-07-12T17:15:54 1720804554

25 years ago I could tell you what version of every CPAN library was in use at my company (because I installed them). What version of what libraries are the devs I support using now? I couldn't begin to tell you. This makes devs happy but I think has harmed the industry in aggregate.

twelfthnight · 2024-07-12T18:10:53 1720807853

Because of containers, my company now can roll out deployments using well defined CI/CD scripts, where we can control installations to force usage of pass-through caches (GCP artifact registry). So it actually has that data you're talking about, but instead of living in one person's head it's stored in a database and accessable to everyone via an API.

bandrami · 2024-07-13T04:53:59 1720846439

Tried that. The devs revolted and said the whole point of containers was to escape the tyranny of ops. Management sided with them, so it's the wild west there.

twelfthnight · 2024-07-13T15:02:09 1720882929

Huh. I actually can understand devs not wanting to need permission to install libraries/versions, but with a pull-through cache there's no restrictions save for security vulnerabilities.

I think it actually winds up speeding up ci/cd docker builds, too.

KronisLV · 2024-07-12T13:02:37 1720789357

> As a user experience Docker is amazing, but as technology it is hot garbage.

I mean, Podman exists, as do lots of custom build tools and other useful options. Personally, I mostly just stick with vanilla Docker (and Compose/Swarm), because it's pretty coherent and everything just fits together, even if it isn't always perfect.

Either way, agreed about the concepts behind the technology making things better for a lot of folks out there, myself included (haven't had prod issues with mismatched packages or inconsistent environments in years at this point, most of my personal stuff also runs on containers).

derefr · 2024-07-12T16:58:50 1720803530

Yeah, but the Open Container Initiative is supposed to be the responsible adults in the room taking the "fail fast" corporate Docker Inc stuff, and taking time to apply good engineering principles to it.

It's somewhat surprising that the results of that process are looking to be nearly as fly-by-the-seat-of-your-pants as Docker itself is.

belter · 2024-07-12T10:58:19 1720781899

Was it really so amazing? Here is half a Docker implementation, in about 100 lines of Bash...

https://github.com/p8952/bocker

redserk · 2024-07-12T12:34:49 1720787689

Lines of code is irrelevant.

Docker is important because:

1) it made a convenient process to build a “system” image of sorts, upload it, download it, and run it.

2) (the important bit!) Enough people adopted this process for it to become basically a standard

Before Docker, it wasnt uncommon to ship some complicated apps in VMs. Packaging those was downright awful with all of the bespoke scripting needed for the various steps of distribution. And then you get a new job? Time to learn a brand new process.

Twirrim · 2024-07-13T18:45:48 1720896348

I guess Docker has been around long enough now that people have forgotten just how much of an absolute pain it used to end up being. Just how often I'd have to repeat the joke Them: "Well, it works on my machine!" Me: "Great, back up your email, we're putting your laptop in production..."

samlinnfer · 2024-07-12T11:19:14 1720783154

The other half is the other 90%.

Looking at it now, it won't even run in the latest systemd, which now refuses to boot with cgroups v1. Good luck even accessing /dev/null under cgroups v2 with systemd.

greiskul · 2024-07-12T16:44:46 1720802686

And like the famous hacker news comment goes, Dropbox is trivial by just using FTP, curlftpfs and SVN. Docker might have many faults, but for anybody that dealt with the problems that it aimed to solve do know in that it was revolutionary in simplifying things.

And for people that disagree, please write a library like TestContainers using cobbled together bash scripts, that can download and cleanly execute and then clean up almost any common use backend dependency.

mschuster91 · 2024-07-12T09:32:15 1720776735

On top of that, it's either the OCI spec that's broken or it's just AWS being nuts, but unlike GitLab and Nexus, AWS ECR doesn't support automatically creating folders (e.g. "<acctid>.dkr.ecr.<region>.amazonaws.com/foo/bar/baz:tag"), it can only do flat storage and either have seriously long image names or tags.

Yes you can theoretically create a repository object in ECR in Terraform to mimic that behavior, but it sucks in pipelines where the result image path is dynamic - you need to give more privileges to the IAM role of the CI pipeline than I'm comfortable with, not to mention that I don't like any AWS resources managed outside of the central Terraform repository.

[1] https://stackoverflow.com/questions/64232268/storing-images-...

hanikesn · 2024-07-12T14:44:49 1720795489

That's seem standard AWS practice. Implement a new feature so you can check the box, but in practice it's a huge pain to actually use.

judge2020 · 2024-07-12T16:12:46 1720800766

It’s not as bad as Azure is (was?) with IPv6: https://news.ycombinator.com/item?id=29327773

X-Istence · 2024-07-12T17:38:59 1720805939

Azure's IPv6 implementation is still flawed and still broken. That has not changed.

xyzzy_plugh · 2024-07-12T18:20:12 1720808412

IIRC it's not in the spec because administration of resources is out of scope. For example, perhaps you offer a public repository and you want folks to sign up for an account before they can push? Or you want to have an approval process before new repositories are created?

Regardless it's a huge pain that ECR doesn't support this. Everybody I know of who has used ECR has run into this.

There's a long standing issue open which I've been subscribed to for years now: https://github.com/aws/containers-roadmap/issues/853

kbumsik · 2024-07-12T10:50:49 1720781449

Actually, Cloudflare open-sourced a container registry server using R2.[1]

Anyone tried it?

[1]: https://github.com/cloudflare/serverless-registry

justin_oaks · 2024-07-12T15:06:17 1720796777

Looks cool. Thanks for linking it.

It does mention that it's limited to 500MB per layer.

For some people's use case that limitation might not be a big deal, but for others that's a dealbreaker.

danesparza · 2024-07-12T19:08:07 1720811287

From the README:

* Pushing with docker is limited to images that have layers of maximum size 500MB. Refer to maximum request body sizes in your Workers plan.

* To circumvent that limitation, you can manually add the layer and the manifest into the R2 bucket or use a client that is able to chunk uploads in sizes less than 500MB (or the limit that you have in your Workers plan).

wofo · 2024-07-12T07:50:06 1720770606

Hi HN, author here. If anyone knows why layer pushes need to be sequential in the OCI specification, please tell! Is it merely a historical accident, or is there some hidden rationale behind it?

Edit: to clarify, I'm talking about sequentially pushing a _single_ layer's contents. You can, of course, push multiple layers in parallel.

abofh · 2024-07-12T16:43:16 1720802596

It makes clean-up simpler - if you never got to the "last" one, it's obvious you didn't finish after N+Timeout and thus you can expunge it. It simplifies an implementation detail (how do you deal with partial uploads? make them easy to spot). Otherwise you basically have to trigger at the end of every chunk, see if all the other chunks are there and then do the 'completion'.

But that's an implementation detail, and I suspect isn't one that's meaningful or intentional. Your S3 approach should work fine btw, I've done it before in a prior life when I was at a company shipping huge images and $.10/gb/month _really_ added up.

You lose the 'bells and whistles' of ECR, but those are pretty limited (imho)

orf · 2024-07-12T18:27:51 1720808871

In the case of a docker registry, isn’t the “final bit” just uploading the final manifest that actually references the layers you’re uploading?

At this point you’d validate that the layers exist and have been uploaded, otherwise you’d just bail out?

And those missing chunks would be handled by the normal registry GC, which evicts unreferenced layers?

abofh · 2024-07-13T09:02:45 1720861365

It's been a long time, but I think you're correct. In my environment I didn't actually care (any failed push would be retried so the layers would always eventually complete, and anything that for whatever reason didn't retry, well, it didn't happen enough that we cared at the cost of S3 to do anything clever).

I think OCI ordered manifests first to "open the flow", but then close is only when the manifests last entry was completed - which led to this ordered upload problem.

If your uploader knows where the chunks are going to live (OCI is more or less CAS, so it's predictable), it can just put them there in any order as long as it's all readable before something tries to pull it.

rcarmo · 2024-07-12T08:01:56 1720771316

Never dealt with pushes, but it’s nice to see this — back when Docker was getting started I dumped an image behind nginx and pulled from that because there was no usable private registry container, so I enjoyed reading your article.

majewsky · 2024-07-16T11:53:46 1721130826

Source: I have implemented a OCI-compliant registry [1], though for the most part I've been following the behavior of the reference implementation [2] rather than the spec, on account of its convolutedness.

When the client finalizes a blob upload, they need to supply the digest of the full blob. This requirement evidently serves to enable the server side to validate the integrity of the supplied bytes. If the server only started checking the digest as part of the finalize HTTP request, it would have to read back all the blob contents that had already been written into storage in previous HTTP requests. For large layers, this can introduce an unreasonable delay. (Because of specific client requirements, I have verified my implementation to work with blobs as large as 150 GiB.)

Instead, my implementation runs the digest computation throughout the entire sequence of requests. As blob data is taken in chunk by chunk, it is simultaneously streamed into the digest computation and into blob storage. Between each request, the state of the digest computation is serialized in the upload URL that is passed back to the client in the Location header. This is roughly the part where it happens in my code: https://github.com/sapcc/keppel/blob/7e43d1f6e77ca72f0020645...

I believe that this is the same approach that the reference implementation uses. Because digest computation can only work sequentially, therefore the upload has to proceed sequentially.

[1] https://github.com/sapcc/keppel [2] https://github.com/distribution/distribution

codethief · 2024-07-12T16:38:25 1720802305

Hi, thanks for the blog post!

> For the last four months I’ve been developing a custom container image builder, collaborating with Outerbounds

I know you said this was something for another blog post but could you already provide some details? Maybe a link to a GitHub repo?

Background: I'm looking for (or might implement myself) a way to programmatically build OCI images from within $PROGRAMMING_LANGUAGE. Think Buildah, but as an API for an actual programming language instead of a command line interface. I could of course just invoke Buildah as a subprocess but that seems a bit unwieldy (and I would have to worry about interacting with & cleaning up Buildah's internal state), plus Buildah currently doesn't support Mac.

throwawaynorway · 2024-07-12T20:54:23 1720817663

If $PROGRAMMING_LANGUAGE = go, you might be looking for https://github.com/containers/storage which can create layers, images, and so on. I think `Store` is the main entry: https://pkg.go.dev/github.com/containers/storage#Store

Buildah uses it: https://github.com/containers/buildah/blob/main/go.mod#L27C2...

Edit: buildkit seems to be the same, used by docker, but needs a daemon?

wofo · 2024-07-12T18:36:15 1720809375

Unfortunately, all the code is proprietary at the moment. If you are willing to get your hands dirty, the main thing to realize is that container layers are "just" tar files (see, for instance, this article: https://ochagavia.nl/blog/crafting-container-images-without-...). Contact details are in my profile, in case you'd like to chat ;)

codethief · 2024-07-12T18:56:42 1720810602

Ah too bad :)

Thanks for the link! Though I'm less worried about the tarball / OCI spec part, more about platform compatibility. I tried running runc/crun by hand at some point and let's just say I've done things before that were more fun. :)

cpuguy83 · 2024-07-12T20:47:38 1720817258

This is what buildkit is. Granted go has the only sdk I know of, but the api is purely protobuf and highly extensible.

cpuguy83 · 2024-07-12T20:49:34 1720817374

For that matter, dagger (dagger.io) provides an sdk in multiple languages and gives you the full power (and then some extra on top) of buildkit.

IanCal · 2024-07-12T08:42:08 1720773728

I can't think of an obvious one, maybe load based?

~~I added parallel pushes to docker I think, unless I'm mixing up pulls & pushes, it was a while ago.~~ My stuff was around parallelising the checks not the final pushes.

Edit - does a layer say which layer it goes "on top" of? If so perhaps that's the reason, so the IDs of what's being pointed to exist.

wofo · 2024-07-12T08:47:49 1720774069

Layers are fully independent of each other in the OCI spec (which makes them reusable). They are wired together through a separate manifest file that lists the layers of a specific image.

It's a mystery... Here are the bits of the OCI spec about multipart pushes (https://github.com/opencontainers/distribution-spec/blob/58d...). In short, you can only upload the next chunk after the previous one finishes, because you need to use information from the response's headers.

IanCal · 2024-07-12T08:51:12 1720774272

Ah thanks.

That's chunks of a single layer though, not multiple layers right?

wofo · 2024-07-12T08:53:36 1720774416

Indeed, you are free to push multiple layers in parallel. But when you have a 1 GiB layer full of AI/ML stuff you can feel the pain!

(I just updated my original comment to make clear I'm talking about single-layer pushes here)

killingtime74 · 2024-07-12T09:54:28 1720778068

Split the layer up?

thangngoc89 · 2024-07-12T10:52:08 1720781528

You can’t. Installing pytorch and supporting dependencies takes 2.2GB on debian-slim.

electroly · 2024-07-12T15:08:31 1720796911

If you've got plenty of time for the build, you can. Make a two-stage build where the first stage installs Python and pytorch, and the second stage does ten COPYs which each grab 1/10th of the files from the first stage. Now you've got ten evenly sized layers. I've done this for very large images (lots of Python/R/ML crap) and it takes significant extra time during the build but speeds up pulls because layers can be pulled in parallel.

thangngoc89 · 2024-07-13T13:37:40 1720877860

I see your point on the pull speed. Most of my pulls are stuck at waiting for the pytorch/dependencies layer.

This might work with pip but I absolutely hate pip and using poetry with great success. I will investigate how to do this with poetry.

fweimer · 2024-07-12T11:10:23 1720782623

Surely you can have one layer per directory or something like that? Splitting along those lines works as long as everything isn't in one big file.

I think it was a mistake to make layers as a storage model visible in to the end user. This should just have been an internal implementation detail, perhaps similar to how Git handles delta compression and makes it independent of branching structure. We also should have delta pushes and pulls, using global caches (for public content), and the ability to start containers while their image is still in transfer.

password4321 · 2024-07-12T11:14:18 1720782858

It should be possible to split into multiple layers as long as each file is wholly within in its layer. This is the exact opposite of the work recommended combining commands to keep everything in one layer which I think is done ultimately for runtime performance reasons.

ramses0 · 2024-07-12T14:06:10 1720793170

I've dug fairly deep into docker layering, it would be wonderful if there was a sort of `LAYER ...` barrier instead of implicitly via `RUN ...` lines.

Theoretically there's nothing stopping you from building the docker image and "re-layering it", as they're "just" bundles of tar files at the end of the day.

eg: `RUN ... ; LAYER /usr ; LAYER /var ; LAYER /etc ; LAYER [discard|remainder]`

yjftsjthsd-h · 2024-07-12T15:37:42 1720798662

I've wished for a long time that Dockerfiles had an explicit way to define layers ripped off from (postgre)sql:

    BEGIN
    RUN foo
    RUN bar
    COMMIT

mdaniel · 2024-07-12T16:11:57 1720800717

At the very real risk of talking out of my ass, the new versioned Dockerfile mechanism on top of builtkit should enable you to do that: https://github.com/moby/buildkit/blob/v0.15.0/frontend/docke...

In true "when all you have is a hammer" fashion, as very best I can tell that syntax= directive is pointing to a separate docker image whose job it is to read the file and translate it into builtkit api calls, e.g. https://github.com/moby/buildkit/blob/v0.15.0/frontend/docke...

But, again for clarity: I've never tried such a stunt, that's just the impression I get from having done mortal kombat with builtkit's other silly parts

skrause · 2024-07-12T17:23:49 1720805029

    RUN <<EOF
    foo
    bar
    EOF

https://www.docker.com/blog/introduction-to-heredocs-in-dock...

yjftsjthsd-h · 2024-07-12T19:06:29 1720811189

Thanks, that helps a lot and I didn't know about it:) It's a touch less powerful than full transactions (because AFAICT you can't say merge a COPY and RUN together) but it's a big improvement.

KronisLV · 2024-07-12T12:59:24 1720789164

That's a pretty cool use case!

Personally, I just use Nexus because it works well enough (and supports everything from OCI images to apt packages and stuff like a custom Maven, NuGet, npm repo etc.), however the configuration and resource usage both are a bit annoying, especially when it comes to cleanup policies: https://www.sonatype.com/products/sonatype-nexus-repository

That said:

> More specifically, I logged the requests issued by docker pull and saw that they are “just” a bunch of HEAD and GET requests.

this is immensely nice and I wish more tech out there made common sense decisions like this, just using what has worked for a long time and not overcomplicating.

I am a bit surprised that there aren't more simple container repositories out there (especially with auth and cleanup support), since Nexus and Harbor are both a bit complex in practice.

thangngoc89 · 2024-07-13T17:03:11 1720890191

I’m using Gitea just for packages (Docker, npm, python, ….) I’m surprised noone has mentioned this in this thread

akeck · 2024-07-12T15:30:29 1720798229

Note that CNCF's Distribution (formerly Docker's Registry) includes support for backing a registry with Cloudfront signed URLs that pull from S3. [1]

https://distribution.github.io/distribution/storage-drivers/...

rad_gruchalski · 2024-07-12T09:55:14 1720778114

What’s wrong with https://github.com/distribution/distribution?

_flux · 2024-07-12T10:26:10 1720779970

I hadn't seen that before, and it indeed does support S3, but does it also offer the clients the downloads directly from S3, or does it merely use it as its own storage backend (so basically work as a proxy when pulling)?

vbezhenar · 2024-07-12T13:18:10 1720790290

It redirects client requests to S3 endpoint, so yes, in the end all heavy traffic goes from S3.

est31 · 2024-07-13T02:03:38 1720836218

But it still means a downtime of your service is a downtime for anything which might need a docker container, whereas if it went to S3 (or cloudfront in front of S3) directly, you profit from the many nines that S3 offers without paying an arm and a leg for ECR (data in ECR costs five times as much as S3 standard tier).

lofties · 2024-07-12T10:29:57 1720780197

This sounds very, very expensive, and I would've loved to see cost mentioned in the article too. (for both S3 and R2)

est31 · 2024-07-13T02:10:19 1720836619

S3's standard tier costs a fifth of ECR in terms of costs per GB stored. Egress costs to the free internet are the same, with the exception that for public ECR repositories they make egress to inner-AWS usage free.

remram · 2024-07-12T13:54:36 1720792476

The cost is the S3 cost though. It depends on region and storage tier, but the storage cost per GB, the GET/PUT cost, and the bandwidth cost can be found on the AWS website: https://aws.amazon.com/s3/pricing/

donatj · 2024-07-12T11:18:03 1720783083

I don't do a ton with Docker outside dev tooling, but I have never understood why private container registries even exist? It just smells like rent seeking. What real advantage does it provide over say just generating some sort of image file you manage yourself, as you please?

figmert · 2024-07-12T11:38:40 1720784320

You don't have to use it. You can use docker save and docker import:

    docker save alpine:3.19 > alpine.tar
    docker load < alpine.tar

But now I have to manage that tar file, have all my systems be aware of where it is, how to access it, etc. Or, I could just not re-invent the wheel and use what docker already has provided.

JackSlateur · 2024-07-12T11:38:25 1720784305

You will probably have images that you will not share to the world. Said images will probably be made available to your infrastructure (k8s clusters, CI/CD runners etc). So you have to either build your own registry or pay someone to do it for you.

Of course, if you use images for dev only, all of that are worthless and you just store your images on your dev machine

regularfry · 2024-07-12T16:18:58 1720801138

Also if your infrastructure is within AWS, you want your images to also be within AWS when the infrastructure wants them. That doesn't necessarily imply a private registry, but it's a lot less work that way.

vel0city · 2024-07-12T17:06:45 1720804005

Why have a code repository instead of just emailing files around?

Because you want a central store someplace with all the previous versions that is easily accessible to lots of consumers.

I don't want to build my app and then have to push it to every single place that might run it. Instead, I'll build it and push it to a central repo and have everything reference that repo.

> It just smells like rent seeking.

You don't need to pay someone to host a private repo for you. There are lots of tools out there so you can self-host.

mcraiha · 2024-07-12T11:56:58 1720785418

Private (cloud) registries are very useful when there are mandatory AuthN/AuthZ things in the project related to the docker images. You can terraform/bicep/pulumi everything per environment.

arccy · 2024-07-12T13:57:54 1720792674

and how do you manage them? you use the same tooling that exists for all public images by running a container registry.

alemanek · 2024-07-12T16:34:25 1720802065

Integration with vulnerability scanning utilities and centralized permissions for orgs are nice benefits.

danmur · 2024-07-12T11:45:06 1720784706

Companies send young engineers (and older engineers who should know more but don't) to AWS and Microsoft for "cloud certification". They learn how to operate cloud services because thats what benefits AWS and MS, so thats what their solutions use.

It's a difficult uphill battle to get people interested in how things work under the hood, which is what you need in order to know you can do things like easily host your own package repositories.

figmert · 2024-07-12T12:13:41 1720786421

This is a odd assessment. I agree certifications aren't all that, but having people learn them isn't about that. It's more that people don't feel like reinventing the wheel at every company, so they can focus on the real work, like shipping the application they've written. So companies like AWS, Docker etc, write things, abstract things away, so someone else doesn't have to redo the whole thing.

Yes I can host my packages and write tooling around it to make it easy. But JFrog already has all the tooling around it, and it integrates with current tooling. Why would I write the whole thing again?

danmur · 2024-07-12T13:48:41 1720792121

I am responding to this part of the parent comment:

> I don't do a ton with Docker outside dev tooling, but I have never understood why private container registries even exist?

You know the options and have made a conscious choice:

> Yes I can host my packages and write tooling around it to make it easy. But JFrog already has all the tooling around it, and it integrates with current tooling. Why would I write the whole thing again?

So presumably you are not the kind of people I was talking about.

EDIT: I'm also assuming by the rent seeking part that the parent is referring to paid hosted services like ECR etc.

watermelon0 · 2024-07-12T18:55:26 1720810526

It seems that ECR is actually designed in a way to support uploading image layers in multiple parts.

Related ECR APIs:

- InitiateLayerUpload API: called at the beginning of upload of each image layer

- UploadLayerPart API: called for each layer chunk (up to 20 MB)

- PutImage API: called after layers are uploaded, to push image manifest, containing references to all image layers

The only weird thing seems to be that you have to upload layer chunks in base64 encoding, which increases data for ~33%.

wofo · 2024-07-15T08:47:06 1721033226

I tried using that API directly, but it unfortunately did require ordered uploads too :')

phillebaba · 2024-07-12T11:08:14 1720782494

Interesting idea to use the file path layout as a way to control the endpoints.

I do wonder though how you would deal with the Docker-Content-Digest header. While not required it is suggested that responses should include it as many clients expect it and will reject layers without the header.

Another thing to consider is that you will miss out on some feature from the OCI 1.1 spec like the referrers API as that would be a bit tricky to implement.

8organicbits · 2024-07-12T11:15:07 1720782907

> that S3 is up to 8x faster than ECR

Awesome. Developer experience is so much better when CI doesn't take ages. Every little bit counts.

barbazoo · 2024-07-12T16:11:54 1720800714

> ECR 24 MiB/s (8.2 s)

> S3 115 MiB/s (1.7 s)

It's great that it's faster but absolutely, it's only an improvement of 6.5s observed, as you said, on the CI server. And it means using something for a purpose that it's not intended for. I'd hate to have to spend time debugging this if it breaks for whatever reason.

wofo · 2024-07-12T18:40:46 1720809646

To be clear, the 8x was comparing the slowest ECR throughput measurement against the fastest S3 one. In any case, the improvement is significant.

cpa · 2024-07-12T08:07:00 1720771620

Is there a good reason for not allowing parallel uploads in the spec?

wofo · 2024-07-12T08:12:43 1720771963

No idea... I asked the same question here (https://news.ycombinator.com/item?id=40943480) and am hoping we'll have a classic HN moment where someone who was involved in the design of the spec will chime in.

benterix · 2024-07-12T08:12:46 1720771966

I believe that even if there was one then, it's probably no longer valid and it's now just a performance limitation.

wofo · 2024-07-12T08:22:49 1720772569

Other than backwards-compatibility, I can imagine simplicity being a reason. For instance, sequential pushing makes it easier to calculate the sha256 hash of the layer as it's being uploaded, without having to do it after-the-fact when the uploaded chunks are assembled.

amluto · 2024-07-12T09:46:08 1720777568

The fact that layers are hashed with SHA256 is IMO a mistake. Layers are large, and using SHA256 means that you can’t incrementally verify the layer as you download it, which means that extreme care would be needed to start unpacking a layer while downloading it. And SHA256 is fast but not that fast, whereas if you really feel like downloading in parallel, a hash tree can be verified in parallel.

A hash tree would have been nicer, and parallel uploads would have been an extra bonus.

cpuguy83 · 2024-07-12T20:45:49 1720817149

sha256 has been around a long time and is highly compatible.

blake3 support has been proposed both in the OCI spec and in the runtimes, which at least for runtimes I expect to happen soon.

I tend to think gzip is the bigger problem, though.

amluto · 2024-07-12T22:36:31 1720823791

> sha256 has been around a long time and is highly compatible.

Sure, and one can construct a perfectly nice tree hash from SHA256. (AWS Glacier did this, but their construction should not be emulated.)

cpuguy83 · 2024-07-13T00:24:49 1720830289

But then every single client needs to support this. sha256 support is already ubiquitous.

amluto · 2024-07-13T01:07:19 1720832839

Every single client already had to implement enough of the OCI distribution spec to be able to parse and download OCI images. Implementing a more appropriate hash, which could be done using SHA-256 as a primitive, would have been a rather small complication. A better compression algorithm (zstd?) is far more complex.

cpuguy83 · 2024-07-13T14:07:39 1720879659

I don't think we can compare reading json to writing a bespoke, secure hashing algorithm across a broad set of languages.

amluto · 2024-07-14T00:43:44 1720917824

Reading JSON that contains a sort of hash tree already. It’s a simple format that contains a mess of hashes that need verifying over certain files.

Adding a rule that you hash the files in question in, say, 1 MiB chunks and hash the resulting hashes (and maybe that’s it, or maybe you add another level) is maybe 10 lines of code in any high level language.

oconnor663 · 2024-07-14T16:29:23 1720974563

Note that secure tree hashing requires a distinguisher between the leaves and the parents (to avoid collisions) and ideally another distinguisher between the root and everything else (to avoid extensions). Surprisingly few bespoke tree hashes in the wild get this right.

amluto · 2024-07-14T23:15:14 1720998914

This is why I said that Glacier’s hash should not be emulated.

FWIW, using a (root hash, data length) pair hides many sins, although I haven’t formally proven this. And I don’t think that extension attacks are very relevant to the OCI use case.

catlifeonmars · 2024-07-12T08:28:22 1720772902

That does not make any sense; as the network usually is a much bigger bottleneck than compute, even with disk reads. You’re paying quite a lot for “simplicity” if that were the case

jtmarmon · 2024-07-12T08:32:07 1720773127

I’m no expert on docker but I thought the hashes for each layer would already be computed if your image is built

cpuguy83 · 2024-07-12T20:40:16 1720816816

It's complicated. If you are using the containerd backed image store (opt-in still) OR if you push with "build --push" then yes.

The default storage backend does not keep compressed layers, so those need to be recreated and digested on push.

With the new store all that stuff is kept and reused.

wofo · 2024-07-12T08:34:36 1720773276

That's true, but I'd assume the server would like to double-check that the hashes are valid (for robustness / consistency)... That's something my little experiment doesn't do, obviously.

michaelmior · 2024-07-12T13:42:55 1720791775

> Why can’t ECR support this kind of parallel uploads? The “problem” is that it implements the OCI Distribution Spec…

I don't see any reason why ECR couldn't support parallel uploads as an optimization. Provide an alternative to `docker push` for those who care about speed that doesn't conform to the spec.

wofo · 2024-07-12T13:50:02 1720792202

Indeed, they could support it through a non-standard API... I wish they did!

champtar · 2024-07-13T00:27:43 1720830463

What I would really love is for the OCI Distribution spec to support just static files, so we can use dumb http servers directly, or even file:// (for pull). All the metadata could be/is already in the manifests, having Content-Type: octet-stream could work just fine.

victorbjorklund · 2024-07-12T09:57:56 1720778276

But this only works for public repos right? I assume docker pull wont use a s3 api key

wofo · 2024-07-12T11:11:29 1720782689

That's true, unfortunately. I'm thinking about ways to somehow support private repos without introducing a proxy in between... Not sure if it will be possible.

victorbjorklund · 2024-07-13T06:29:45 1720852185

Maybe a very thin proxy layer on like cloudflare functions.

wofo · 2024-07-13T07:21:36 1720855296

Yep, I originally thought that wouldn't work... But now I discovered (thanks to a comment here) that the registry is allowed to return an HTTP redirect instead of serving the layer blobs directly... Which opens new possibilities :)

kevin_nisbet · 2024-07-12T18:27:59 1720808879

It's cool to see it, I was interested in trying something similar a couple years ago but priorities changed.

My interest was mainly around a hardening stand point. The base idea was the release system through IAM permissions would be the only system with any write access to the underlying S3 bucket. All the public / internet facing components could then be limited to read only access as part of the hardening.

This would of course be in addition to signing the images, but I don't think many of the customers at the time knew anything about or configured any of the signature verification mechanisms.

tealpod · 2024-07-12T11:37:11 1720784231

This is such a wonderful idea, congrats.

There is a real usecase for this in some high security sectors. I can't put complete info here for the security reasons, let me know if you are interested.

lazy_moderator1 · 2024-07-12T14:07:21 1720793241

That's neat! On that note I've been using S3 as a private registry for years now via Gitlab and couldn't be happier!

jaimehrubiks · 2024-07-12T08:21:04 1720772464

I experience everyday the slowness of pushing big images (ai related tend to be big) to ECR on our cicd.

wofo · 2024-07-12T08:24:17 1720772657

I wonder whether the folks at Cloudflare could take the ideas from the blog post and create a high-performance serverless container registry based on R2. They could call it scrubs, for "serverless container registry using blob storage" :P

zokier · 2024-07-12T09:36:49 1720777009

I just point out that AWS ECR is exactly that, thin wrapper around S3. See for example https://docs.aws.amazon.com/AmazonECR/latest/APIReference/AP...

slig · 2024-07-12T10:42:45 1720780965

Hopefully someone gets inspired by this and implements a thin wrapper using CF workers.

slig · 2024-07-12T12:43:26 1720788206

Hopefully someone gets inspired by this and implements a thin wrapper using CF workers.

Edit: CF already did it https://github.com/cloudflare/serverless-registry

wofo · 2024-07-12T11:13:07 1720782787

I didn't expect that! It's a pity they don't expose an API for parallel uploads, for those of us who need to maximize throughput and don't mind using something non-standard.

sshine · 2024-07-12T08:55:43 1720774543

https://www.youtube.com/watch?v=3QbOssRq0Gs

  [Chorus: Chilli & T-Boz]
  No, I don't want no scrub
  A scrub is a guy that can't get no love from me
  Hangin' out the passenger side of his best friend's ride
  Trying to holla at me
  I don't want no scrub
  A scrub is a guy that can't get no love from me
  Hangin' out the passenger side of his best friend's ride
  Trying to holla at me

iampims · 2024-07-12T12:50:27 1720788627

You mean this?

A Docker registry backed by Workers and R2.

https://github.com/cloudflare/serverless-registry

wofo · 2024-07-12T12:54:23 1720788863

Ah, that's great! I'll have to look into it :)

dheera · 2024-07-13T05:06:46 1720847206

Make sure you use HTTPS, or someone could theoretically inject malicious code into your container. If you want to use your own domain you'll have to use CloudFront to wrap S3 though.

wofo · 2024-07-15T08:48:36 1721033316

That's what I thought originally, but you can actually use `https://<your-bucket>.s3.amazonaws.com` without Cloudfront or other service on top (it wasn't easy to find in the AWS docs, but it works)

ericfrederich · 2024-07-12T12:37:40 1720787860

R2 in only "free" until it isn't. Cloudflare hasn't got a lot of good press recently. Not something I'd wanna build my business around.

TheMrZZ · 2024-07-12T13:40:38 1720791638

Aside from the casino story (high value target that likely faces tons of attacks, therefore an expensive customer for CF), did something happen with them? I'm not aware of bad press around them in general

jgrahamc · 2024-07-12T13:08:53 1720789733

R2 egress is free.

fnord77 · 2024-07-12T19:53:42 1720814022

> What makes S3 faster than ECR?

the author is missing something huge - ECR does a security scan on upload, too.

ericpauley · 2024-07-12T10:26:50 1720780010

Where's the source code?

wofo · 2024-07-12T11:07:10 1720782430

The source code is proprietary, but it shouldn't take much work to replicate, fortunately (you just need to upload files at the right paths).

seungwoolee518 · 2024-07-12T13:19:52 1720790392

Like treat path as a object key, and put value as a json or blob?

filleokus · 2024-07-12T11:40:08 1720784408

I've started to grow annoyed with container registry cloud products. Always surprisingly cumbersome to auto-delete old tags, deal with ACL or limit the networking.

It would be nice if a Kubernetes distro took a page out of the "serverless" playbook and just embedded a registry. Or maybe I should just use GHCR

breatheoften · 2024-07-12T14:28:11 1720794491

I'm using google's artifact registry -- aside from upload speed another thing that kills me is freakin download speed ... Why in the world should it take 2 minutes to download a 2.6 GB layer to a cloud build instance sitting in the same region as the artifact registry ... Stupidly slow networking really harms the stateless ci machine + docker registry cache which actually would be quite cool if it was fast enough ...

In my case it's still faster than doing the builds would be -- but I'm definitely gonna have to get machines with persistent local cache in the mix at some point so that these operations will finish within a few seconds instead of a few minutes ...

kevin_nisbet · 2024-07-12T19:48:15 1720813695

We did this in the Gravity Kubernetes Distribution (which development is shut down), but we had to for the use case. Since the distribution was used to take kubernetes applications behind the firewall with no internet access we needed the registry... and it was dead simple just running the docker-distribution registry on some of the nodes.

In theory it wouldn't be hard to just take docker-distribution and run it as a pod in the cluster with an attached volume if you wanted a registry in the cluster. So it's probably somewhere between trivial and takes a bit of effort if you're really motivated to have something in cluster.

freerc1347 · 2024-07-13T02:17:21 1720837041

Have you tried zot? https://www.cncf.io/projects/zot/

https://zotregistry.dev/

Here are all the projects already using zot in some form or another.

https://github.com/project-zot/zot/issues/2117

vbezhenar · 2024-07-12T13:19:46 1720790386

Kubernetes is extremely bare-bones, there's no way they'll embed a registry. Kubernetes doesn't touch images at all, AFAIK, it delegates that to the container runtime, e.g. containerd.

If you want some lightweight registry, use "official" docker registry. I'm running it inside Kubernetes and it consumes it just fine.

mdaniel · 2024-07-12T16:20:44 1720801244

> Always surprisingly cumbersome to auto-delete old tags,

Does this not do what you want? https://docs.aws.amazon.com/AmazonECR/latest/userguide/lifec...

I can't speak to the other "registry cloud products" except for GitLab, which is its own special UX nonsense, but they also support expiry after enough whisky consumption