Hacker News new | past | comments | ask | show | jobs | submit login

We all agree about how much important debian is for all tech community. Why is it so difficult to scale their snapshot service? Serving static files at scale is a solved problem. Am I missing something? Can't a cloud provider help them out?



It's very, very, expensive. I don't know the details for debian, but pypi.org, hosting python packages, costs $800k/month: https://twitter.com/dstufft/status/1236331765846990848

I imaging debian is also super expensive, and so scaling that must not be easy. Every decision could be thousands of dollars.


This lazy constant pulling of dependencies by CI systems and containers is not very substainable. pypi should set up limits and make people use some cache proxy.


I worked at a place (briefly) where the CI process was pulling down 3-4GiB of container images for every run. Same repos, same sites, every check-in on every branch pushed to GitHub set up a blank environment and pulled everything down afresh. Then the build process yoinked dozens of packages, again with no cache. It must have consumed terabytes a day. Madness!


I am curious about how such a cache could be setup reliably.

1. Proxy inside the company network that intercepts every request to pypi? Doesn't work well due to https, I guess.

2. Replacing pypi in your project description with your own mirror? Might work, but at least slows down every build outside of your network. Also needs to be replicated for npm, cargo, maven, docker, ...

3. Start a userspace wrapper before starting the build that transparently redirects every request. That would be the best solution, IMO. But how do I intercept https requests of child processes? Technically, it must be possible, but is there such a tool?


> Also needs to be replicated for npm, cargo, maven, docker, ...

All these frameworks come with a tool to run an internal repo/mirror.

There are also commercial offerings (like artifactory) that cover all these languages in a single tool.

For python, just set PIP_INDEX in the CI environment to point to the internal mirror and it will be used automatically. It's very easy.

By default the downloaded wheels are cached in ~/cache/somewhere. Isolated builds don't benefit from it if they start with a fresh filesystem on every run, consider mounting a writable directory to use as cache, the speedup is more than worth it.


Just so. It does take a little configuration. The system I was talking about had been band-aided since the company's two-person days, when I'm sure the builds were a fraction of the size. Good example of infra tech debt.


For 2.:

pip can be configured to use a local warehouse, and there are warehouses that transparently proxy to pypi, but caches any previously fetched result. E.G: https://pypi.org/project/proxypypi/

Since you control it, and it's read only from the outside, you can actually expose it even outside of your network.

But indeed, it must be replicated for npm, cargo, maven, docker...

There is a startup idea here :)


Use Nix instead. :P


Agreed. In fact, if you do more than 100 requests/minute for the same IP, you should get throttled. If you want out of it, you should pay.


Setting up your own proxy that covers pretty much everything (maven/npm/nuget/pypi/gems/docker/etc.) is not difficult and takes only a few hours of work. I went for sonatype nexus, but many (most?) cases can even be covered with nginx caching proxy.


This didn't go so well for Docker Inc. when they tried to do this with Docker Hub


Doesn't matter, the PSF is not for profit. If people starts trying to avoid the throttle by using alternatives, it just saves money.


AIUI, PyPI doesn't _actually_ cost that much to host: it's all donated/sponsored hosting (with data largely served from Fastly), and the "cost" I'd expect is "what it would cost if we actually had to pay the normal published prices".


Is that with at least some attempt at building a CDN? Generally cloud providers don't charge for traffic between hosts in the same availability zone. One could think about putting a slave into each major availability zone of various cloud providers. The main service would then only be used to create HTTP redirects to the specific slave, or if none exists, or the package isn't replicated yet at the slave, just answer directly.

Even if such a system isn't built, with that kind of money on the table you could get a team of FANG scale developers to build it for you.


All the CI/CD build agents with no cache and so on. This is a general problem for all tech. For the web, cache is cheap but as far as I know there is no equal way to cache builds as cheap.

I think there needs to be a redesign in how dependencies work in most programming languages. Deterministic builds have been such a game changer and I think that CPU vs bandwidth may be the next big area to explore when it comes to compiling code.


Isn't Debian opinionated about the freedom of the stack it stands on? Would the community be happy to build a dependency on a vendor?


Debian is opinionated about software freedom - thankfully.

But it's OK to accept donations in hardware or money as long as there are no strings attached.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: