This lazy constant pulling of dependencies by CI systems and containers is not very substainable. pypi should set up limits and make people use some cache proxy.
I worked at a place (briefly) where the CI process was pulling down 3-4GiB of container images for every run. Same repos, same sites, every check-in on every branch pushed to GitHub set up a blank environment and pulled everything down afresh. Then the build process yoinked dozens of packages, again with no cache. It must have consumed terabytes a day. Madness!
I am curious about how such a cache could be setup reliably.
1. Proxy inside the company network that intercepts every request to pypi? Doesn't work well due to https, I guess.
2. Replacing pypi in your project description with your own mirror? Might work, but at least slows down every build outside of your network. Also needs to be replicated for npm, cargo, maven, docker, ...
3. Start a userspace wrapper before starting the build that transparently redirects every request. That would be the best solution, IMO. But how do I intercept https requests of child processes? Technically, it must be possible, but is there such a tool?
> Also needs to be replicated for npm, cargo, maven, docker, ...
All these frameworks come with a tool to run an internal repo/mirror.
There are also commercial offerings (like artifactory) that cover all these languages in a single tool.
For python, just set PIP_INDEX in the CI environment to point to the internal mirror and it will be used automatically. It's very easy.
By default the downloaded wheels are cached in ~/cache/somewhere. Isolated builds don't benefit from it if they start with a fresh filesystem on every run, consider mounting a writable directory to use as cache, the speedup is more than worth it.
Just so. It does take a little configuration. The system I was talking about had been band-aided since the company's two-person days, when I'm sure the builds were a fraction of the size. Good example of infra tech debt.
pip can be configured to use a local warehouse, and there are warehouses that transparently proxy to pypi, but caches any previously fetched result. E.G: https://pypi.org/project/proxypypi/
Since you control it, and it's read only from the outside, you can actually expose it even outside of your network.
But indeed, it must be replicated for npm, cargo, maven, docker...
Setting up your own proxy that covers pretty much everything (maven/npm/nuget/pypi/gems/docker/etc.) is not difficult and takes only a few hours of work. I went for sonatype nexus, but many (most?) cases can even be covered with nginx caching proxy.
AIUI, PyPI doesn't _actually_ cost that much to host: it's all donated/sponsored hosting (with data largely served from Fastly), and the "cost" I'd expect is "what it would cost if we actually had to pay the normal published prices".