Hacker News new | past | comments | ask | show | jobs | submit login
Python typosquatting is about more than typos (iqt.org)
106 points by jspeed-meyers on Oct 2, 2020 | hide | past | favorite | 50 comments



One thing that simply HAS to happen in order to put a stop to typosquatting and related errors: FFS, package maintainers should be forced to have the name of the package on the repository be the same as the main import name.

Quick, Python programmers, how do you find the current version of Beautiful Soup? How do you import it? How about Scikit-Learn?*

Those are two of the most important packages on the ecosystem, and they make you guess as to what the actual name is. STUPID. STUPID. STUPID.

* Answers: pip install beautifulsoup4, import bs4; pip install scikit-learn, import sklearn


I like the go way: import a url. If you want to alias the namespace however you want afterward then that's your problem. It doesn't solve every single security issues, obviously. But at the very least the source of the code is more transparent.

Second option: signed packages. A warning when a package is not digitally signed.

Finally, a package repository curated by whatever language foundation. But then we are getting close to OS wide native package management...

But at the very least namespace+package name should be the norm for every single package manager out there. instead of just a package name.

So it should be "pip install code.launchpad.net/beautifulsoup", not just "pip install beautifulsoup4", if it's not possible already. And then

    import "code.launchpad.net/beautifulsoup" as beautifulsoup
in the code.


Right URLs never change. Remember when one of the most popular golang logging libraries renamed from github.com/Sirupsen/logrus to github.com/sirupsen/logrus and all of a sudden everything would break because you were using both via some dep vendor.

Or Java where you get 80 char of namespaces that may or may not change because the library moved to be an eclipse project.


FWIW, Go’s module system (which was invented afterwards) explicitly encodes uppercase vs lowercase in a platform independent way internally to prevent that from happening again.


> import a url

Or at least have a strong convention that all project namespaces be url-like. That's what Java did decades ago, and it remains one of the ecosystems which care the most about traceability.

Sure, the actual url may drift from the namespace, but at least you've got a much more google-able starting point than "lxml" or whatever.


> Second option: signed packages. A warning when a package is not digitally signed.

It's trivial to sign a malicious package with a signature.

The problem is to check and make sure it's signed with the signature


And it doesn't help typosquatting since baeutifulsoup will have a perfectly valid signature.


Absolutely not. URL-based imports are a security risk, availability nightmare, and fragile system built with the assumption that things are never renamed, taken down or significantly changed, or that the developer always has an internet connection.

Depends cannot be resolved offline, and time becomes a factor for the correctness of code (i.e. you fetch a dependency, code against it, dependency changes public API, coworker fetches your code, fetches the dependency, build is now inexplicably broken).

Not to mention DNS hijacking, BGP, caching, locality, etc. can all affect how dependencies are fetched. Further, not using a registry makes in-house mirrors a nightmare due to having to re-route DNS requests all the time.

It's an all around _terrible_ feature in my opinion and I am very careful to distance myself from any project claiming otherwise.


Sure, the go way:

    import (
     "fmt"
     "github.com/dimuska139/go-email-normalizer"
    )
    
    func main() {
     fmt.Println(emailnormalizer.NewNormalizer().Normalize("x+1@gmail.com"))
    }
Wait, how did I know to do "emailnormalizer.NewNormalizer()"? Well, that's the package name of the package at that url.

The package name is only the same as the last part of the URL in go by convention, and importing a package can put an arbitrary package identifier into your package scope.


FWIW, `goimports` would say "OK, technically that's valid, but I'm going to complain unless you explicitly import it as 'emailnormalizer'".

    import (
            "fmt"

            emailnormalizer "github.com/dimuska139/go-email-normalizer"
    )



and what happened to the "go-" prefix? is it silently dropped?


No. The go files just have "package emailnormalizer" within the go files, so that is the name that you use to reference it, even if the import url is different.


Plot twist: the go way works in python.

You can pip install from a github URL, preferably with a branch rather than master.


Will URL imports actually solve the problem. Couldn’t a malicious actor simply register a similar name: e.g. `pip install code.lauchpad.net/beautifulsoup`?

I would also guess many packages would be hosted on github: `pip install popular-org.github.io/package` which could be typo-squatted by creating a similar sounding github user `pip install poplar-org.github.io/package`.


Or for that matter what happens if the original author loses control of the domain?


If you lose the domain, the old packages can’t be changed because they have a hash. New, evil packages are a problem, but that can also happen now if someone steals your password or does an email reset of your account.


What we really need is a web-of-trust. I've been saying this for years. I know who I trust on GitHub. We even have our public keys there. It's perfectly possible to sign a release with an SSH key, even if GPG would be better.

It won't solve all the issues[0], but it would solve many of these obvious, unskilled attacks and I'm getting kinda sick of the blasé attitude in OSS.

[0] For example, an honest software developer can get local malware that could modify a file just before it is committed and signed.


cough Debian has enforced this for at least two decades.


Yes! And there are some steps in the Ruby community to pull in a web-of-trust as well. It's just frustrating that it didn't happen from the get-go in language after language.

That said, I applaud the Debian folks for putting in the work early. Not just in the web-of-trust, but in many aspects of software security.


You mean like crev? https://github.com/crev-dev/crev/

Basically a web of code reviews. There is a pup integration in the early stages per the README. I came across crev through the cargo/rust integration.


It's not just Python that has this problem. In Debian the zlib package is called zlib1g. I'm not sure what the '1g' is for, but what really puzzles me is why Debian's package for the Rust bindings is called librust-libz-sys-dev with the description "libz library (also known as zlib)"

I guess it's too late to fix this apparent typo? The project is called zlib, ''libz'' is simply incorrect.


The "1g" in "zlib1g" is a version. Dynamic libraries can have multiple versions which can be installed in parallel. zlib hasn't changed much lately, so there's only really one version that matters, but this does come into play for other libraries. For example, LibUSB has libusb-0.1 and libusb-1.0, both of which are widely used.

librust-libz-sys-dev is a Debian package for a Rust library that's literally called "libz-sys" [1]. Why the developers of this package decided to call it this and not "zlib" is a question for them, but Debian packages libraries based on the language's own name for the package, not its own reinterpretation of those names.

[1]: https://crates.io/crates/libz-sys



The mingw package (libz-mingw-w64) also has it backwards. As far as I can tell, that upstream gets it right ('mingw-w64-x86_64-zlib' I believe.)

Also, putting version number suffixes on package names when the package system already has a concept of version numbers never sat right with me. It seems like a hack around tooling deficiencies.


The Debian packaging system only allows one version of a package to be installed at a time, as it assumes that a newer version of a package will contain substantially the same files as a previous version. Versioned libraries install different files, and can be installed alongside each other.


> ''libz'' is simply incorrect.

Well, you'd better tell the zlib maintainers to change `make install` to stop installing it as `/usr/local/lib/libz.so.1` by default, if "libz" is simply incorrect.

Less facetiously: the way Debian names library packages is that if it installs "{LIBDIR}/lib{NAME}.so.{SOVER}", then the package name is "lib{NAME}g{SOVER}", regardless of what the "project name" is.


I believe libz is the name of the library file for zlib, and libz-sys is the rust library's name


It's not a typo.

The conventional name of a native code library (C, C++, etc) in Debian is derived from the SONAME of the library it installs. So you can be pretty sure that libfoo.so.3 is shipped in a binary package libfoo3.

So what about the weird exceptions, like zlib1g?

Well the package name deviates from the standard in two ways.

Firstly, you'd think it should be called 'libz1'. Given the age of zlib (the first entry in /usr/share/doc/zlib1g/changelog.Debian.gz dates back to September 1996!), I think that the naming of the binary package predates the adoption of the convention that would have resulted in it being named 'libz1'.

Second, what is this 'g' suffix? This dates back to the GLibc transition.

You see, a long time ago, Linux distributions used a fork of Glibc, known as 'Linux libc'. See https://manpages.debian.org/libc.7 for details. Long story short, it was decided to abandon the fork and adopt the GNU C Library (Glibc).

For many end-user programs, a recompile was all that was needed to effect this transition. Before switching, the the binary package of such a program would depend on libc5; afterwards it would depend on libc6. Simple.

But many programs depend on shared libraries. In this case, the program could not be rebuilt against Glibc until all its dependend-upon libraries had also been rebuilt against Glibc. Effectively, libfoo-built-with-Glibc was an ABI change from libfoo-built-with-Linux-libc.

In order to prevent crashes and other malfunctions from linking libc.so.5 and libc.so.6 into the same program at runtime, the decision was made to rename the binary packages libfoo3 to libfoo3g (the g being short for Glibc). Once upstream bumped its SONAME to libfoo.so.4, libfoo3g would be replaced by libfoo4, which has always been linked against Glibc and so the 'g' suffix could be dropped.

(Aside: I don't know the reason why why SONAME of libraries was not changed at the same time, i.e. change from libfoo.so.3 to libfoo.so.3g... perhaps because co-ordinating that rename across all Linux distributions was too big a job for too small a gain?)

If you look at the dependencies for the zlib1g package, you will notice 'Conflicts: zlib1 (<= 1:1.0.4-7)'. This is there to prevent the zlib1g package being installed on the same system as a package 'bar' which still depends on zlib1 (because 'bar' hasn't been rebuilt against glibc yet...)

If we refer back to the changelog.Debian file, we find...

    zlib (1:1.0.4-7.1) unstable; urgency=low
    
      * Updated for libc6
      * Compiled with -D_REENTRANT.
      * Non mantainer release.
    
     -- Enrique Zanardi <ezanardi@molec1.dfis.ull.es>  Wed, 17 Sep 1997 01:28:05 +0100
Which tells us that 1:1.0.4-7.1 was the version where the binary package was renamed to zlib1g. Ok, the changelog message is a little terse, but the community was a lot smaller in those days and who could call themselves a serious user of Linux without being aware of the Glibc transition? ;)

Over the years, similar transitions have taken place. The 'c102' transition occurred when GCC 3.2 broke the ABI for all C++ code. 'libbar3' became 'libbar3c102' when libbar was changed to being built with GCC 3.2, and the 'c102' prefix was similarly dropped once libbar upstream bumped its SONAME to libbar.so.4 which was packages as 'libbar4'. It looks like libfam0c102 is the sole remaining package with this naming convention in the archive, all others having dropped the prefix long ago. You can read about the GCC transition plan at https://lists.debian.org/debian-devel-announce/2003/01/msg00... the Glibc transition occurred so long ago that I wasn't able to find any similar documents explaining it (admittedly after only brief Google searches). Although https://lists.debian.org/debian-gcc/2002/08/msg00091.html (also about the c102 transition) refers to the Glibc transition:

> This is similar in spirit to the glibc transition adding `g' to the end of libraries.

TL;DR: the package is called zlib1g because it was called 'zlib' before the standard Debian naming scheme for libraries was adopted; and because it was renamed to 'zlib1g' when it was transitioned from building against Linux libc to Glibc, and since then it was remained ABI stable. If zlib upstream ever bump the SONAME to libz.so.2 then zlib1g will be replaced by libz2.

There are a couple of other packages in the archive which haven't had a SONAME bump since the glibc transition: libcanna1g and libpam0g. Their changelogs confirm that they are named so due to the glibc transition.

(There's one other package, libgjs0g that is a red herring. It looks like upstream broke ABI without bumping the SONAME for whatever reason, leaving it to distributions to pick up the pieces. Typical. The way this was solved in Debian was to rename the binary package from libgjs0 -> libgjs0a -> libgjs0b -> ... -> libgjs0e -> libgjs0g and since then it's remained stable.)


Sometimes this is a feature though. PIL is abandoned but was widely used, so its replacement pillow can use the same import as it is a drop-in replacement.


Wouldn't it be better to have some explicit method of aliasing your dependencies if this is actually something you want to do? It seems wrong to have that as a feature opaque like that.


It’d probably be better, but it is an accidental feature of the current system.


Agreed. I maintain a fork of a package that loads plugins; and these plugins import from the original package's name. If I couldn't use the same name, users would need to edit every old plugin's code.


> Quick, Python programmers, how do you find the current version of Beautiful Soup? How do you import it? How about Scikit-Learn?

> Those are two of the most important packages on the ecosystem, and they make you guess as to what the actual name is. STUPID. STUPID. STUPID.

Well, my approach is to go to the project website and look for installation instructions, because searching pip is completely useless.


What's especially odd is python would let you do `import beautifulsoup4 as bs4`, so there's no need to pick `bs4` for the module name. (For `scikit-learn`, I'd say restricting package names to valid python module names is fine.) Imagine if `numpy` had been named `np` or `pandas` `pd`.


Considering I have written import numpy as np probably thousands of times I would be quite okay with import np.


There is also nothing preventing you from installing multiple importable modules, or module names already used by somebody else, etc. It is really a behavior that most people would describe as "broken" if there wasn't a need to keep supporting the legacy of existing packages.


There is another category - mistyped commands that are not misspelled.

For example, there is a package named “install”. Typing “pip install install” installs it. If it were malicious, a user might type “pip install “, then copy a complete command from the web that also contains “pip install”. That would result in “pip install pip install <...>}, which effectively prepends “pip” and “install” - both of which are packages - to the list of packages to be installed.

I once tried to claim “requirements.txt” for this reason, but it’s (thankfully, and reasonably) blacklisted. A user mistyping “pip install requirements.txt” (omitting the “-e” flag) would have otherwise installed it. I was planning on writing a small script that alerted the user and offered to install the packages they really wanted if they approved.


Chinese-speaking security researcher fate0 found a way around the blacklisting of requirements.txt in 2017. There is a write-up in English of it here (second half of the chapter): https://haukeluebbers.de/blog/2020-01-timeline-of-package-de...

I agree, it is a different category of attack on the human side of the package manager installation process.


> (omitting the “-e” flag)

You mean "-r"?


Probably.

Python is my favorite language by far, but I’ve not worked in it daily for many months.


-r pulls requirements from a file, line by line.

-e means editable, it's mostly used for source installs of libraries you want to work on while installed in a project.


My mnemonic for the "-r" flag is to just remember that k8s gets it right:

"kubectl apply -f myfile" ("apply file ...")

and pip stacks two verbs together in a silly way:

"pip install -r myfile" ("install read ..."??)


I was surprised to learn recently that Java's main repository Maven Central apparently has a manual review for new packages where they validate that you own the domain of your project or the github for it. I'm not sure to what extent, but that seems very reasonable.

Edit: Here's the link https://central.sonatype.org/articles/2014/Feb/27/why-the-wa...


They do. And it does take day our couple of days. But somehow it's high enough barrier of entry that NPM and Go style of publishing packages is taking over the world...


Happy to see this brought up. I consider myself to be a bit security minded and was astounded by the trash heap of similarly titled packages available while configuring a new install of pycharm. There is a serious glut of packages with names that appear to intentionally attract installations by mistake. I wound up probing each package I needed prior to install, at one point landing in a random GitHub repo set to private. ... idk ... Use protection folks!


This is going to seriously burn someone at one point. But right now, I think our package repositories are simply too convenient.

Maybe someone will aggressively squat a load of names and upload a package that scrolls a million lines of "YOUR REPOSITORY IS INSECURE!!1". Maybe that will make devs more security conscious.

First one to shake up the world by massively exploiting this gets to decide how bad it's going to be. "rm -rf /" would be fun too, or exfiltrating everyones .ssh folder.


I had a related issue with dateutil package: https://github.com/dateutil/dateutil/issues/327 Unfortunately they don't care that much.


I'm curious what sort of malicious code was distributed this way.


I hate this about python.

Lost hour in typos in imports...




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: