PyPI in a Box (2020)

digisign · on Feb 25, 2022

We had mirrors on every continent for linux distributions (and other things) way back before the turn of the century! cough cough

Mildly surprised a service as big as PyPI doesn't.

melissalobos · on Feb 25, 2022

I think the idea here isn't that there isn't a PyPI mirror in Africa. It is that not everyone has internet, so this person wants to have a tiny computer output a local WiFi network people can connect to and download pip packages. Imagine a small town or village with some power and a classroom, but no internet. A teacher could setup a network and have students connect to this device and download packages so they can complete some assignment/make the next Facebook.

bscphil · on Feb 25, 2022

Or that the Internet access they do have is often metered. A friend tells me 100MB costs about 1 USD where he is, which is not an insignificant amount of money. Really puts the whole 300 MB electron app thing in perspective; at any rate it's understandable why having a PyPI mirror in the classroom would be preferable to having each student download the packages over and over.

woodruffw · on Feb 25, 2022

> Mildly surprised a service as big as PyPI doesn't.

PyPI uses Fastly as a CDN, which does indeed have presence on every continent (except the big, cold one). The problem here isn't presence, but connectivity to the outside Internet itself.

Source: I'm an active developer on PyPI.

mjw1007 · on Feb 25, 2022

I think PyPI is mostly served via Fastly.

https://dustingram.com/articles/2021/04/14/powering-the-pyth...

qbasic_forever · on Feb 25, 2022

I remember at pycon ~2016 one of the maintainers of pypi did a short talk on it and the entire pypi service at the time only ran on one or two boxes. It was surprisingly scrappy for such a critical service.

tempay · on Feb 26, 2022

This has changed a lot in recent years and is now in a much better state. Their docs give a really nice overview of the evolution over time https://warehouse.pypa.io/application.html

heavyset_go · on Feb 25, 2022

PyPI needs more funding. PyPI even disabled their search API for the pip CLI because of infrastructure overload. It would be nice if more sponsors stepped up to fund their infrastructure.

db65edfc7996 · on Feb 25, 2022

I think companies could choose to be more responsible with their usage. Looking at PyPi utilization, I have to imagine the bulk of it comes from CI/CD tooling hammering the servers without any intermediate caching.

sgtnoodle · on Feb 26, 2022

That's definitely true. There's so many build jobs running in docker containers that lack a pip cache. I believe my company has its own package servers, but I've specifically avoided adding build system features because of that lack of caching. For one of me, there's probably a dozen people that would just hit the main pypi servers every run.

heavyset_go · on Feb 26, 2022

Definitely. I don't think hammering PyPI's API would be much of problem if the companies doing the hammering paid for the privilege of doing so, though.

tentacleuno · on Feb 25, 2022

Serious question: should charities start downloading the entirety of NPM and PyPi (+ Node, NPM, Linux packages, etc.) onto hard drives and passing them onto schools teaching software development in third-world countries alongside the computers they normally send? I'm not sure if that's even possible, or how big the HDD would have to be to store all that, but it would be very very nice nevertheless. I can't imagine not being able to install one-off software packages.

bool3max · on Feb 25, 2022

PyPI is more complex and probably serves way more traffic.

woodruffw · on Feb 25, 2022

This is great work!

Python packaging is complicated for many reasons (both good and bad), but PyPI's index format is delightfully simple. Projects like this reinforce my opinion that keeping it simple has been a great decision by the Python community and PyPI admins.

dralley · on Feb 26, 2022

>Python packaging is complicated for many reasons (both good and bad), but PyPI's index format is delightfully simple.

In a lot of the worst ways. Dependency resolution can't be done locally, it requires tons of separate network requests to fetch all the requirements.

danmur · on Feb 26, 2022

I don't have as many complaints about Python's dependency system as most people but that issue is very painful. Calculating dependencies properly is slow and very vulnerable to network failures, causes constant issues inside corporate networks with crappy VPNs and unreliable networks.

woodruffw · on Feb 26, 2022

Yeah, this is probably the single largest wart in the Python packaging ecosystem. It's gotten better over time (wheels improve things, as do third-party tools like Pipenv and pip-compile), but there's still a lot to do.

jborean93 · on Feb 25, 2022

In don’t wish to take away from pypiserver and how nice and simple it is but I found it’s missing certain features that are present in the PyPI Wharehouse that underpins pypi.org.

If I remember it correctly it didn’t expose the python_requires metadata or maybe something to do with extras that to wanted to test. It was definitely an edge case but still something to be aware off.

quietbritishjim · on Feb 25, 2022

devpi acts as a caching proxy for PyPI and takes a bit less setup than this. Plus, you can use it for storing your own packages in a separate index.

https://github.com/devpi/devpi

dralley · on Feb 25, 2022

You can also do this with Pulp, and have it act as a caching proxy that lazily caches the packages only when they first get downloaded.

It's a lot more heavyweight though, so maybe it's not the best choice for a Raspberry Pi.

simonw · on Feb 25, 2022

The developer experience in regions that don't have fast internet access is hard to imagine, especially with bandwidth-hogs like the npm ecosystem.

See also https://meyerweb.com/eric/thoughts/2018/08/07/securing-sites... - which points out that when every site moved to https it broke local caching proxies, which had a big negative impact on people in countries with slower internet.

tentacleuno · on Feb 25, 2022

Regarding the article you sent:

> Beyond deploying service workers and hoping those struggling to bridge the digital divide make it across, I don’t really have a solution here.

One way to get around this would be a MITM HTTPS certificate, which would allow a local caching proxy to decrypt and encrypt requests. This means you'd have to install the certificate on each device you wish to be compatible with the caching proxy (otherwise it can't decrypt it), though.

yjftsjthsd-h · on Feb 25, 2022

> At the time of writing this post, the entire PyPI repository is somewhere in the neighbourhood of 1TB but, by using a selective download, I was able to get it down to 120GB or so.

That's the surprising part to me; that config doesn't look like it's filtering by popularity or such, just python versions? I guess packages aren't huge, but I'm still surprised that it's so small.

oefrha · on Feb 26, 2022

Maybe PyPI was that small when the post was written. It currently sits at 10.4TB; TensorFlow alone is more than 1TB. I suppose you can still keep the mirror relatively small if you exclude most of the giant ML wheels.

https://pypi.org/stats/

woodruffw · on Feb 25, 2022

The average Python package distribution is pretty small: source distributions are (or should be) compressed source code, and binary distributions ("wheels") should be (mostly) compressed Python bytecode and the occasional platform-specific binary.

There are also relatively conservative per-project and per-releaase upload limits. PyPI's admins can increase those limits on a project or individual release, but the baseline limits have seemingly preserved the commons reasonably well.

animal_spirits · on Feb 26, 2022

I bet this can be added to the Internet In a Box project. It already comes with python installed.

https://internet-in-a-box.org/

mistrial9 · on Feb 26, 2022

Q. where is the last version 2.7 PyPi archived?

qbasic_forever · on Feb 26, 2022

You mean python? https://www.python.org/downloads/ still has 2.7

If you're looking for libraries that support 2.7 you're in for a lot more work, you'd have to go library by library and figure out when was the last version they produced that supported 2.7. Almost certainly it will be way out of date and depend on other out of date things.

There was no single moment where every library 'threw the switch' and moved off 2.7 to 3.x, it happened organically over about 10 years. A lot of libraries spent a significant amount of time (and code) supporting both 2.7 and 3.x from the same codebase.