Alex Becker
Posted on April 30, 2019
Most programming language's package ecosystems have two levels: each package has one or more release, which are distinguished by a version. Python has a third: each release has one or more distributions, which are the actual files you download to install a package. In most languages said file is synonymous with the release, but the "or more" is crucial in Python, because for most releases of most widely-used packages, there is in fact more than one distribution.
Why? Well, Python is special in that it treats C extensions as a first-class feature of the language, and tries to insulate package users from having to compile C extensions. This means that distributions need to contain binary code compiled from the C extensions—such distributions (in their modern iteration) are called binary wheels. But C extensions usually need to be compiled for a specific target Python version and operating system, so to have any sort of wide support you need multiple wheels. Furthermore, since the package author can't anticipate all Python versions and operating systems (some of which don't exist yet!), it's also important to include a source distribution, which the package user is responsible for compiling.
Despite this, users—and most tools—still think in terms of release versions rather than specific distributions. This can lead to surprising inconsistencies. For example, installing a package might take seconds on one machine (because there is a matching binary distribution) and minutes or even hours on another. Even if both machines find appropriate binary distributions to install, their hashes won't match, making it more difficult to detect MitM attacks. This is because tools like pip
automatically determine the "most suitable" distribution for a release, preferring binary wheels when one is compatible with the given system—and the most specific binary wheel if multiple are—and otherwise falling back to the source distribution. Most other tools follow suit, if only by virtue of using pip
under the hood.
The biggest problems crop up when a new distribution is published after you've already installed another distribution for that release. This is all but inevitable—PyPI only lets you upload distributions one at a time, creating a new release with the first upload that has a new version, so eventually someone is bound to download that first distribution before you upload the last one. It's made much more common by the practice of having buildbots build different distributions in parallel, with binary distributions generally taking significantly longer than source distributions. But even worse is when a package author goes back to add support for a new platform—or a new version of python—for a release months or years after the fact. When this happens:
- Build systems which expect a certain hash for a given package suddenly break.
- PyPI mirrors like PyDist don't know to look for the new distribution and get out of sync.
- Systems which have previously installed a distribution for the release (say, your dev machine) won't get the new distribution, and might behave differently from systems that install it (say, your production servers).
There is no obvious way to fix these issues without significantly disrupting the ecosystem—although the PyPI maintainers are aware of the pain and discussing tool improvements. In the meantime, it is incumbent on heavy Python users and system administrators to understand how Python packages are distributed and how pip
chooses distributions.
Posted on April 30, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 25, 2024
September 24, 2024