This lazy constant pulling of dependencies by CI systems and containers is not v...

ericbarrett · on Oct 12, 2021

I worked at a place (briefly) where the CI process was pulling down 3-4GiB of container images for every run. Same repos, same sites, every check-in on every branch pushed to GitHub set up a blank environment and pulled everything down afresh. Then the build process yoinked dozens of packages, again with no cache. It must have consumed terabytes a day. Madness!

choeger · on Oct 12, 2021

I am curious about how such a cache could be setup reliably.

1. Proxy inside the company network that intercepts every request to pypi? Doesn't work well due to https, I guess.

2. Replacing pypi in your project description with your own mirror? Might work, but at least slows down every build outside of your network. Also needs to be replicated for npm, cargo, maven, docker, ...

3. Start a userspace wrapper before starting the build that transparently redirects every request. That would be the best solution, IMO. But how do I intercept https requests of child processes? Technically, it must be possible, but is there such a tool?

user5994461 · on Oct 12, 2021

> Also needs to be replicated for npm, cargo, maven, docker, ...

All these frameworks come with a tool to run an internal repo/mirror.

There are also commercial offerings (like artifactory) that cover all these languages in a single tool.

For python, just set PIP_INDEX in the CI environment to point to the internal mirror and it will be used automatically. It's very easy.

By default the downloaded wheels are cached in ~/cache/somewhere. Isolated builds don't benefit from it if they start with a fresh filesystem on every run, consider mounting a writable directory to use as cache, the speedup is more than worth it.

ericbarrett · on Oct 13, 2021

Just so. It does take a little configuration. The system I was talking about had been band-aided since the company's two-person days, when I'm sure the builds were a fraction of the size. Good example of infra tech debt.

BiteCode_dev · on Oct 12, 2021

For 2.:

pip can be configured to use a local warehouse, and there are warehouses that transparently proxy to pypi, but caches any previously fetched result. E.G: https://pypi.org/project/proxypypi/

Since you control it, and it's read only from the outside, you can actually expose it even outside of your network.

But indeed, it must be replicated for npm, cargo, maven, docker...

There is a startup idea here :)

otabdeveloper4 · on Oct 12, 2021

Use Nix instead. :P

BiteCode_dev · on Oct 12, 2021

Agreed. In fact, if you do more than 100 requests/minute for the same IP, you should get throttled. If you want out of it, you should pay.

5e92cb50239222b · on Oct 12, 2021

Setting up your own proxy that covers pretty much everything (maven/npm/nuget/pypi/gems/docker/etc.) is not difficult and takes only a few hours of work. I went for sonatype nexus, but many (most?) cases can even be covered with nginx caching proxy.

nickstinemates · on Oct 12, 2021

This didn't go so well for Docker Inc. when they tried to do this with Docker Hub

BiteCode_dev · on Oct 12, 2021

Doesn't matter, the PSF is not for profit. If people starts trying to avoid the throttle by using alternatives, it just saves money.

gsnedders · on Oct 12, 2021

AIUI, PyPI doesn't _actually_ cost that much to host: it's all donated/sponsored hosting (with data largely served from Fastly), and the "cost" I'd expect is "what it would cost if we actually had to pay the normal published prices".