Hacker News new | past | comments | ask | show | jobs | submit login
PyPI in a Box (2020) (vuyisile.com)
88 points by luu on Feb 25, 2022 | hide | past | favorite | 28 comments



We had mirrors on every continent for linux distributions (and other things) way back before the turn of the century! cough cough

Mildly surprised a service as big as PyPI doesn't.


I think the idea here isn't that there isn't a PyPI mirror in Africa. It is that not everyone has internet, so this person wants to have a tiny computer output a local WiFi network people can connect to and download pip packages. Imagine a small town or village with some power and a classroom, but no internet. A teacher could setup a network and have students connect to this device and download packages so they can complete some assignment/make the next Facebook.


Or that the Internet access they do have is often metered. A friend tells me 100MB costs about 1 USD where he is, which is not an insignificant amount of money. Really puts the whole 300 MB electron app thing in perspective; at any rate it's understandable why having a PyPI mirror in the classroom would be preferable to having each student download the packages over and over.


> Mildly surprised a service as big as PyPI doesn't.

PyPI uses Fastly as a CDN, which does indeed have presence on every continent (except the big, cold one). The problem here isn't presence, but connectivity to the outside Internet itself.

Source: I'm an active developer on PyPI.



I remember at pycon ~2016 one of the maintainers of pypi did a short talk on it and the entire pypi service at the time only ran on one or two boxes. It was surprisingly scrappy for such a critical service.


This has changed a lot in recent years and is now in a much better state. Their docs give a really nice overview of the evolution over time https://warehouse.pypa.io/application.html


PyPI needs more funding. PyPI even disabled their search API for the pip CLI because of infrastructure overload. It would be nice if more sponsors stepped up to fund their infrastructure.


I think companies could choose to be more responsible with their usage. Looking at PyPi utilization, I have to imagine the bulk of it comes from CI/CD tooling hammering the servers without any intermediate caching.


That's definitely true. There's so many build jobs running in docker containers that lack a pip cache. I believe my company has its own package servers, but I've specifically avoided adding build system features because of that lack of caching. For one of me, there's probably a dozen people that would just hit the main pypi servers every run.


Definitely. I don't think hammering PyPI's API would be much of problem if the companies doing the hammering paid for the privilege of doing so, though.


Serious question: should charities start downloading the entirety of NPM and PyPi (+ Node, NPM, Linux packages, etc.) onto hard drives and passing them onto schools teaching software development in third-world countries alongside the computers they normally send? I'm not sure if that's even possible, or how big the HDD would have to be to store all that, but it would be very very nice nevertheless. I can't imagine not being able to install one-off software packages.


PyPI is more complex and probably serves way more traffic.


This is great work!

Python packaging is complicated for many reasons (both good and bad), but PyPI's index format is delightfully simple. Projects like this reinforce my opinion that keeping it simple has been a great decision by the Python community and PyPI admins.


>Python packaging is complicated for many reasons (both good and bad), but PyPI's index format is delightfully simple.

In a lot of the worst ways. Dependency resolution can't be done locally, it requires tons of separate network requests to fetch all the requirements.


I don't have as many complaints about Python's dependency system as most people but that issue is very painful. Calculating dependencies properly is slow and very vulnerable to network failures, causes constant issues inside corporate networks with crappy VPNs and unreliable networks.


Yeah, this is probably the single largest wart in the Python packaging ecosystem. It's gotten better over time (wheels improve things, as do third-party tools like Pipenv and pip-compile), but there's still a lot to do.


In don’t wish to take away from pypiserver and how nice and simple it is but I found it’s missing certain features that are present in the PyPI Wharehouse that underpins pypi.org.

If I remember it correctly it didn’t expose the python_requires metadata or maybe something to do with extras that to wanted to test. It was definitely an edge case but still something to be aware off.


devpi acts as a caching proxy for PyPI and takes a bit less setup than this. Plus, you can use it for storing your own packages in a separate index.

https://github.com/devpi/devpi


You can also do this with Pulp, and have it act as a caching proxy that lazily caches the packages only when they first get downloaded.

It's a lot more heavyweight though, so maybe it's not the best choice for a Raspberry Pi.


The developer experience in regions that don't have fast internet access is hard to imagine, especially with bandwidth-hogs like the npm ecosystem.

See also https://meyerweb.com/eric/thoughts/2018/08/07/securing-sites... - which points out that when every site moved to https it broke local caching proxies, which had a big negative impact on people in countries with slower internet.


Regarding the article you sent:

> Beyond deploying service workers and hoping those struggling to bridge the digital divide make it across, I don’t really have a solution here.

One way to get around this would be a MITM HTTPS certificate, which would allow a local caching proxy to decrypt and encrypt requests. This means you'd have to install the certificate on each device you wish to be compatible with the caching proxy (otherwise it can't decrypt it), though.


> At the time of writing this post, the entire PyPI repository is somewhere in the neighbourhood of 1TB but, by using a selective download, I was able to get it down to 120GB or so.

That's the surprising part to me; that config doesn't look like it's filtering by popularity or such, just python versions? I guess packages aren't huge, but I'm still surprised that it's so small.


Maybe PyPI was that small when the post was written. It currently sits at 10.4TB; TensorFlow alone is more than 1TB. I suppose you can still keep the mirror relatively small if you exclude most of the giant ML wheels.

https://pypi.org/stats/


The average Python package distribution is pretty small: source distributions are (or should be) compressed source code, and binary distributions ("wheels") should be (mostly) compressed Python bytecode and the occasional platform-specific binary.

There are also relatively conservative per-project and per-releaase upload limits. PyPI's admins can increase those limits on a project or individual release, but the baseline limits have seemingly preserved the commons reasonably well.


I bet this can be added to the Internet In a Box project. It already comes with python installed.

https://internet-in-a-box.org/


Q. where is the last version 2.7 PyPi archived?


You mean python? https://www.python.org/downloads/ still has 2.7

If you're looking for libraries that support 2.7 you're in for a lot more work, you'd have to go library by library and figure out when was the last version they produced that supported 2.7. Almost certainly it will be way out of date and depend on other out of date things.

There was no single moment where every library 'threw the switch' and moved off 2.7 to 3.x, it happened organically over about 10 years. A lot of libraries spent a significant amount of time (and code) supporting both 2.7 and 3.x from the same codebase.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: