How "pip install" Works

alexbecker

Alex Becker

Posted on March 5, 2019

How "pip install" Works

What happens when you run pip install <somepackage>? A lot more than you might think. Python's package ecosystem is quite complex.

First pip needs to decide which distribution of the package to install.
This is more complex for Python than many other languages, since each version (or release) of a Python package usually has multiple distributions. There are 7 different kinds of distributions, but the most common these days are source distributions and binary wheels. A source distribution is exactly what it says on the tin—the raw Python and potentially C extension code that the package developers wrote. A binary wheel is a more complex archive format, which can contain compiled C extension code. This is convenient for users, because compiling, say, numpy from source takes a long time (~4 minutes on my desktop), and it is hard for package authors to ensure that their source code will compile on other people's machines. But it comes at a price--the compiled code is specific to the architecture and often the OS it was compiled on, so most packages with C extensions will build multiple wheel distributions, and pip needs to decide which if any are suitable for your computer.

To find the distributions available, pip requests https://pypi.org/simple/<somepackage>, which is a simple HTML page full of links, where the text of the link is the filename of the distribution. The filenames encode the version, kind of distribution, and for binary wheels, the architecture and OS they are compatible with. This format is complex enough to be covered by two different PEPs:

  • The version scheme is covered by PEP 440.
  • Binary wheel filename compatibility tags are the subject of PEP 425.

To select a distribution, pip first determines which distributions are compatible with your system and implementation of python. For binary wheels, it parses the filenames according to PEP 425, extracting the python implementation, application binary interface, and platform. The python implementation can be something as broad as py2.py3 (meaning "any implementation of python 2.X or 3.X") or it can specify a python interpreter and major version, such as pp35 (meaning PyPy version 3.5). The application binary interface is essentially what version of CPython's C-API the C extension code is compatible with, if there is any. Interpreting the platform portion of the compatibility tag is more difficult. It can be relatively obvious, like win32 for 32-bit Windows, but I am usually installing manylinux1 wheels. Which Linux distributions are compatible with manylinux1 is a subject of heavy debate on the distutils mailing list. Luckily the process for source distributions is simpler—all source distributions are assumed to be compatible, at least at this step in the process.

Once pip has a list of compatible distributions, it sorts them by version, chooses the most recent version, and then chooses the "best" distribution for that version. It prefers binary wheels if there are any, and if they are multiple it chooses the one most specific to the install environment. These are just pip's default preferences though—they can be configured with options like --no-binary or --prefer-binary. The "best" distribution is either downloaded or installed from the local cache, which on Linux is usually located in ~/.cache/pip.

Determining the dependencies for this distribution is not simple either. In theory, one could just use the requires_dist value from https://pypi.org/pypi/<somepackage>/<version>/json. However, this relies on the package author uploading the correct metadata, and older packaging clients do not do so. So in practice pip (and anyone else who wants to know the dependencies of a package) have to download and inspect it.

For binary wheels, the dependencies are listed in a file called METADATA. But for source distributions the dependencies are effectively whatever gets installed when you execute their setup.py script with the install command. There's no way to know unless you try it, which is what pip does! Specifically, it leverages setuptools to run install up to the point where it knows what dependencies to install. However, this can be further complicated by the fact that running install might itself require dependencies. The standard way to specify this in a Python package is to pass a the setup_requires argument to setuptools.setup. By way of setuptools, pip will run setup.py just enough to discover setup_requires, install those dependencies, then go back and execute setup.py again. Naturally, this is madness and setup_requires should never be used.

Once pip has a list of requirements, it starts this whole process over again for each required package, taking into account any constraints on its version. It builds a whole tree of packages this way, until every dependency of every distribution it has found is already in the tree. This process breaks of course if there is a dependency cycle, but it will always terminate—after all, there are only finitely many python packages!

What happens though if one of the distributions pip finds violates the requirements of another, for example if it pip first finds idna version 2.5 but then finds a distribution requiring idna<=2.4? Well, it ignores the requirement and installs idna anyway! There is a longstanding issue open to add a true dependency resolver to pip, with lots of false starts and partial implementations, but none have ever quite made it in. This is of course in large part due to the complexity of determining the dependencies for a python package—it is very difficult to build an efficient dependency resolver when determining the dependencies of a single candidate requires downloading and executing potentially megabytes of code!

Next pip has to actually build and install the package. If it downloaded a source distribution, and the wheel package is installed, it will first build a binary wheel specifically for your machine out of the source. Then it needs to determine which library directory to install the package in—the system's, the user's, or a virtualenv's? This is controlled by sys.prefix, which in turn is controlled by pip's executable path and the PYTHONPATH and PYTHONHOME environment variables. Finally, it moves the wheel files into the appropriate library directory, and compiles the python source files into bytecode for faster execution.

Now your package is installed! I've really only scratched the surface—there are dozens of options that change pip's behavior, many corner cases of other distribution types and platform limitations, and I didn't even touch on installing multiple packages (which is handled differently than a package with multiple dependencies). But I hope it this was informative, if not useful.

💖 💪 🙅 🚩
alexbecker
Alex Becker

Posted on March 5, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related