How to get information about the provenance of Python packages installed

fridex

Fridolín Pokorný

Posted on April 13, 2023

How to get information about the provenance of Python packages installed

Let's take a look on how to obtain information about the provenance of installed packages in the Python ecosystem. This idea is part of PEP-710 which is in a draft state as of today.

Image description

Židlochovice - Rozhledna Akátová věž; Czech republic. Image by author.


The tutorial uses files that are available at github.com/fridex/pip-provenance.

Let's create a simple Python application using Chainguard's Python image. This application will be a simple flask hello world application. The app.py script will have the following content:

from flask import Flask

app = Flask(__name__)

@app.route('/')
def index():
    return 'Hello, world!'

app.run(host='0.0.0.0', port=8080)
Enter fullscreen mode Exit fullscreen mode

Additionally, we will create a requirements.in file with the following content:

flask
Enter fullscreen mode Exit fullscreen mode

We will use pip-tools to lock dependencies to specific versions for reproducibility. Also, we will keep hashes of the Python distributions installed:

pip-compile --generate-hashes
Enter fullscreen mode Exit fullscreen mode

The command above will create a requirements.txt file. An example of such a file can be found here.

Next, let's create a containerized environment with our application.

Using the upstream pip

First, we will use the upstream pip which is also shipped in Chainguard's images. We can directly take the Dockerfile as written by Chainguard with minimal changes to make sure we have a containerized application:

FROM cgr.dev/chainguard/python:latest-dev as builder

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt --user

FROM cgr.dev/chainguard/python:latest

WORKDIR /app
# Make sure you update Python version in path
COPY --from=builder /home/nonroot/.local/lib/python3.11/site-packages /home/nonroot/.local/lib/python3.11/site-packages
COPY app.py .

ENTRYPOINT ["python", "/app/app.py"]
Enter fullscreen mode Exit fullscreen mode

The containerized application can be built:

podman build -f raw/Dockerfile -t pip-provenance:raw .
Enter fullscreen mode Exit fullscreen mode

Subsequently, the built application can be run and accessed at locahost:8080:

podman run -p 8080:8080 pip-provenance:raw
Enter fullscreen mode Exit fullscreen mode

Now, let's imagine someone published this image to a registry and we would like to get information about the packages installed. We can pull the pip-provenance:raw image and run pip freeze. Unfortunately, pip freeze shows only Python packages installed and their versions:

$ pip freeze                     
click==8.1.3
Flask==2.2.3
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.2
Werkzeug==2.2.3
Enter fullscreen mode Exit fullscreen mode

We don't have any information from where these packages were actually installed. Also, we do not have any information on digests of these packages. An exception are packages installed using a direct URL following PEP-610, but that's not the case in our example.

Using the patched pip

There was a proposal in PEP-710 to store provenance information about the installed packages when they are identified using their name, and optionally their version (which is our example). Let's take a look on what information is stored and how we could access it.

First, let's adjust our Dockerfile to use a patched version of pip that follows PEP-710:

FROM cgr.dev/chainguard/python:latest-dev as builder

WORKDIR /app
COPY requirements.txt .
# ----->%------
USER root
RUN pip install --force-reinstall pip install git+https://github.com/fridex/pip.git@provenance-url
USER nonroot
# -----%<------
RUN pip install -r requirements.txt --user

FROM cgr.dev/chainguard/python:latest

WORKDIR /app
# Make sure you update Python version in path
COPY --from=builder /home/nonroot/.local/lib/python3.11/site-packages /home/nonroot/.local/lib/python3.11/site-packages
COPY app.py .

ENTRYPOINT ["python", "/app/app.py"]
Enter fullscreen mode Exit fullscreen mode

Let's build this application:

podman build -f patched/Dockerfile -t pip-provenance:patched .
Enter fullscreen mode Exit fullscreen mode

We can run the application and access it at localhost:8080, the changes introduced in pip will have no effect on it:

podman run -p 8080:8080 pip-provenance:patched
Enter fullscreen mode Exit fullscreen mode

Following PEP-710, pip stores information about the provenance in *.dist-info directories that are located in site-packages. Let's copy the site-packages directory out of the containerized environment so that we can check what was installed there (substitute [CONTAINER_HASH] with the hash of the containerized environment that was run in the previous example):

podman cp [CONTAINER_HASH]:/home/nonroot/.local/lib/python3.11/site-packages site-packages
Enter fullscreen mode Exit fullscreen mode

We can take a look at provenance_url.json file for package flask*:

$ cat ./site-packages/Flask-2.2.3.dist-info/provenance_url.json | jq
{
  "archive_info": {
    "hash": "sha256=c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d",
    "hashes": {
      "sha256": "c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d"
    }
  },
  "url": "https://files.pythonhosted.org/packages/95/9c/a3542594ce4973786236a1b7b702b8ca81dbf40ea270f0f96284f0c27348/Flask-2.2.3-py3-none-any.whl"
}
Enter fullscreen mode Exit fullscreen mode

This file is created by the patched pip and is described more in detail in PEP-710.

A small tool, called pip-preserve, can read content of the site-packages directory and understands the provenance_url.json for each Python package installed. Moreover, if a package was installed using a direct URL, the tool can also read direct_url.json as described in PEP-610 to fully reconstruct the environment. Let's use the tool on our site-packages directory from the containerized environment:

$ pip install pip-preserve
...
$ pip-preserve --ignore-errors --site-packages ./site-packages      
#
# This file is autogenerated by pip-preserve version 0.0.2.post1 with Python 3.10.6.
#
click==8.1.3 \
  --hash=sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48
flask==2.2.3 \
  --hash=sha256:c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d
itsdangerous==2.1.2 \
  --hash=sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44
jinja2==3.1.2 \
  --hash=sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61
markupsafe==2.1.2 \
  --hash=sha256:f2bfb563d0211ce16b63c7cb9395d2c682a23187f54c3d79bfec33e6705473c6
werkzeug==2.2.3 \
  --hash=sha256:56433961bc1f12533306c624f3be5e744389ac61d722175d543e1751285da612
Enter fullscreen mode Exit fullscreen mode

As you can see, the tool reconstructed requirements.txt file, listing all the packages installed together with their versions and hashes.

A reader can notice that the reconstructed file has only one hash per package. The reason is that pip installs only one package. Our original requirements.txt file lists multiple hashes that correspond to Python distributions as published on PyPI at the time the pip-compile command was run. On installation time, pip takes the one that is matching the environment to which the Python distribution is installed. For example, pip took the wheel file published for flask==2.2.3, not the source distribution available on PyPI (you can verify it by checking artifact hashes). Using the patched version of pip, we can point to the exact artifact that was installed.

If we pass --direct-url option to the pip-preserve tool, we can get exact URLs from where Python packages were installed:

$ pip-preserve --ignore-errors --direct-url --site-packages ./site-packages
#
# This file is autogenerated by pip-preserve version 0.0.2.post1 with Python 3.10.6.
#
https://files.pythonhosted.org/packages/c2/f1/df59e28c642d583f7dacffb1e0965d0e00b218e0186d7858ac5233dce840/click-8.1.3-py3-none-any.whl \
  --hash=sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48
https://files.pythonhosted.org/packages/95/9c/a3542594ce4973786236a1b7b702b8ca81dbf40ea270f0f96284f0c27348/Flask-2.2.3-py3-none-any.whl \
  --hash=sha256:c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d
https://files.pythonhosted.org/packages/68/5f/447e04e828f47465eeab35b5d408b7ebaaaee207f48b7136c5a7267a30ae/itsdangerous-2.1.2-py3-none-any.whl \
  --hash=sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44
https://files.pythonhosted.org/packages/bc/c3/f068337a370801f372f2f8f6bad74a5c140f6fda3d9de154052708dd3c65/Jinja2-3.1.2-py3-none-any.whl \
  --hash=sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61
https://files.pythonhosted.org/packages/5a/94/d056bf5dbadf7f4b193ee2a132b3d49ffa1602371e3847518b2982045425/MarkupSafe-2.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
  --hash=sha256:f2bfb563d0211ce16b63c7cb9395d2c682a23187f54c3d79bfec33e6705473c6
https://files.pythonhosted.org/packages/f6/f8/9da63c1617ae2a1dec2fbf6412f3a0cfe9d4ce029eccbda6e1e4258ca45f/Werkzeug-2.2.3-py3-none-any.whl \
  --hash=sha256:56433961bc1f12533306c624f3be5e744389ac61d722175d543e1751285da612
Enter fullscreen mode Exit fullscreen mode

Why is this useful?

Okay, now we know we can get information about the provenance of installed packages using PEP-710. If we take a look at other packages, such as TensorFlow, we can see that there are published multiple wheel files - each corresponding to a specific environment. If we just pip install tensorflow, which wheel file is actually used (assuming we do not have access to installation logs all the time)?

Also note, there can be specific builds of Python packages hosted on a private Python package index. These wheels can be built with options that might not be expressed using wheel tags. If you are using a Python environment (not necessarily a containerized environment), how do you know the provenance of the Python packages installed (without accessing installation logs, or eventually any build configuration)?

Built containerized environments used in this article are available at docker.io/fridex/pip-provenance:

podman pull fridex/pip-provenance:raw
podman pull fridex/pip-provenance:patched
Enter fullscreen mode Exit fullscreen mode

You can follow related discussion about PEP-710 at discuss.python.org.


*Even though the provenance_url.json files produced by the patched pip keep the hash key, PEP-710 does not define it. The patched pip implementation uses code that is defined by PEP-610 (direct URL). The hash key is now deprecated in the direct_url.json file introduced by PEP-610.

💖 💪 🙅 🚩
fridex
Fridolín Pokorný

Posted on April 13, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related