How to get information about the provenance of Python packages installed
Fridolín Pokorný
Posted on April 13, 2023
Let's take a look on how to obtain information about the provenance of installed packages in the Python ecosystem. This idea is part of PEP-710 which is in a draft state as of today.
Židlochovice - Rozhledna Akátová věž; Czech republic. Image by author.
The tutorial uses files that are available at github.com/fridex/pip-provenance.
Let's create a simple Python application using Chainguard's Python image. This application will be a simple flask hello world application. The app.py
script will have the following content:
from flask import Flask
app = Flask(__name__)
@app.route('/')
def index():
return 'Hello, world!'
app.run(host='0.0.0.0', port=8080)
Additionally, we will create a requirements.in
file with the following content:
flask
We will use pip-tools to lock dependencies to specific versions for reproducibility. Also, we will keep hashes of the Python distributions installed:
pip-compile --generate-hashes
The command above will create a requirements.txt
file. An example of such a file can be found here.
Next, let's create a containerized environment with our application.
Using the upstream pip
First, we will use the upstream pip which is also shipped in Chainguard's images. We can directly take the Dockerfile as written by Chainguard with minimal changes to make sure we have a containerized application:
FROM cgr.dev/chainguard/python:latest-dev as builder
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt --user
FROM cgr.dev/chainguard/python:latest
WORKDIR /app
# Make sure you update Python version in path
COPY --from=builder /home/nonroot/.local/lib/python3.11/site-packages /home/nonroot/.local/lib/python3.11/site-packages
COPY app.py .
ENTRYPOINT ["python", "/app/app.py"]
The containerized application can be built:
podman build -f raw/Dockerfile -t pip-provenance:raw .
Subsequently, the built application can be run and accessed at locahost:8080:
podman run -p 8080:8080 pip-provenance:raw
Now, let's imagine someone published this image to a registry and we would like to get information about the packages installed. We can pull the pip-provenance:raw
image and run pip freeze
. Unfortunately, pip freeze
shows only Python packages installed and their versions:
$ pip freeze
click==8.1.3
Flask==2.2.3
itsdangerous==2.1.2
Jinja2==3.1.2
MarkupSafe==2.1.2
Werkzeug==2.2.3
We don't have any information from where these packages were actually installed. Also, we do not have any information on digests of these packages. An exception are packages installed using a direct URL following PEP-610, but that's not the case in our example.
Using the patched pip
There was a proposal in PEP-710 to store provenance information about the installed packages when they are identified using their name, and optionally their version (which is our example). Let's take a look on what information is stored and how we could access it.
First, let's adjust our Dockerfile to use a patched version of pip that follows PEP-710:
FROM cgr.dev/chainguard/python:latest-dev as builder
WORKDIR /app
COPY requirements.txt .
# ----->%------
USER root
RUN pip install --force-reinstall pip install git+https://github.com/fridex/pip.git@provenance-url
USER nonroot
# -----%<------
RUN pip install -r requirements.txt --user
FROM cgr.dev/chainguard/python:latest
WORKDIR /app
# Make sure you update Python version in path
COPY --from=builder /home/nonroot/.local/lib/python3.11/site-packages /home/nonroot/.local/lib/python3.11/site-packages
COPY app.py .
ENTRYPOINT ["python", "/app/app.py"]
Let's build this application:
podman build -f patched/Dockerfile -t pip-provenance:patched .
We can run the application and access it at localhost:8080, the changes introduced in pip will have no effect on it:
podman run -p 8080:8080 pip-provenance:patched
Following PEP-710, pip stores information about the provenance in *.dist-info
directories that are located in site-packages
. Let's copy the site-packages
directory out of the containerized environment so that we can check what was installed there (substitute [CONTAINER_HASH]
with the hash of the containerized environment that was run in the previous example):
podman cp [CONTAINER_HASH]:/home/nonroot/.local/lib/python3.11/site-packages site-packages
We can take a look at provenance_url.json
file for package flask
*:
$ cat ./site-packages/Flask-2.2.3.dist-info/provenance_url.json | jq
{
"archive_info": {
"hash": "sha256=c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d",
"hashes": {
"sha256": "c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d"
}
},
"url": "https://files.pythonhosted.org/packages/95/9c/a3542594ce4973786236a1b7b702b8ca81dbf40ea270f0f96284f0c27348/Flask-2.2.3-py3-none-any.whl"
}
This file is created by the patched pip and is described more in detail in PEP-710.
A small tool, called pip-preserve, can read content of the site-packages
directory and understands the provenance_url.json
for each Python package installed. Moreover, if a package was installed using a direct URL, the tool can also read direct_url.json
as described in PEP-610 to fully reconstruct the environment. Let's use the tool on our site-packages
directory from the containerized environment:
$ pip install pip-preserve
...
$ pip-preserve --ignore-errors --site-packages ./site-packages
#
# This file is autogenerated by pip-preserve version 0.0.2.post1 with Python 3.10.6.
#
click==8.1.3 \
--hash=sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48
flask==2.2.3 \
--hash=sha256:c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d
itsdangerous==2.1.2 \
--hash=sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44
jinja2==3.1.2 \
--hash=sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61
markupsafe==2.1.2 \
--hash=sha256:f2bfb563d0211ce16b63c7cb9395d2c682a23187f54c3d79bfec33e6705473c6
werkzeug==2.2.3 \
--hash=sha256:56433961bc1f12533306c624f3be5e744389ac61d722175d543e1751285da612
As you can see, the tool reconstructed requirements.txt
file, listing all the packages installed together with their versions and hashes.
A reader can notice that the reconstructed file has only one hash per package. The reason is that pip installs only one package. Our original requirements.txt
file lists multiple hashes that correspond to Python distributions as published on PyPI at the time the pip-compile
command was run. On installation time, pip takes the one that is matching the environment to which the Python distribution is installed. For example, pip took the wheel file published for flask==2.2.3, not the source distribution available on PyPI (you can verify it by checking artifact hashes). Using the patched version of pip, we can point to the exact artifact that was installed.
If we pass --direct-url
option to the pip-preserve
tool, we can get exact URLs from where Python packages were installed:
$ pip-preserve --ignore-errors --direct-url --site-packages ./site-packages
#
# This file is autogenerated by pip-preserve version 0.0.2.post1 with Python 3.10.6.
#
https://files.pythonhosted.org/packages/c2/f1/df59e28c642d583f7dacffb1e0965d0e00b218e0186d7858ac5233dce840/click-8.1.3-py3-none-any.whl \
--hash=sha256:bb4d8133cb15a609f44e8213d9b391b0809795062913b383c62be0ee95b1db48
https://files.pythonhosted.org/packages/95/9c/a3542594ce4973786236a1b7b702b8ca81dbf40ea270f0f96284f0c27348/Flask-2.2.3-py3-none-any.whl \
--hash=sha256:c0bec9477df1cb867e5a67c9e1ab758de9cb4a3e52dd70681f59fa40a62b3f2d
https://files.pythonhosted.org/packages/68/5f/447e04e828f47465eeab35b5d408b7ebaaaee207f48b7136c5a7267a30ae/itsdangerous-2.1.2-py3-none-any.whl \
--hash=sha256:2c2349112351b88699d8d4b6b075022c0808887cb7ad10069318a8b0bc88db44
https://files.pythonhosted.org/packages/bc/c3/f068337a370801f372f2f8f6bad74a5c140f6fda3d9de154052708dd3c65/Jinja2-3.1.2-py3-none-any.whl \
--hash=sha256:6088930bfe239f0e6710546ab9c19c9ef35e29792895fed6e6e31a023a182a61
https://files.pythonhosted.org/packages/5a/94/d056bf5dbadf7f4b193ee2a132b3d49ffa1602371e3847518b2982045425/MarkupSafe-2.1.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl \
--hash=sha256:f2bfb563d0211ce16b63c7cb9395d2c682a23187f54c3d79bfec33e6705473c6
https://files.pythonhosted.org/packages/f6/f8/9da63c1617ae2a1dec2fbf6412f3a0cfe9d4ce029eccbda6e1e4258ca45f/Werkzeug-2.2.3-py3-none-any.whl \
--hash=sha256:56433961bc1f12533306c624f3be5e744389ac61d722175d543e1751285da612
Why is this useful?
Okay, now we know we can get information about the provenance of installed packages using PEP-710. If we take a look at other packages, such as TensorFlow, we can see that there are published multiple wheel files - each corresponding to a specific environment. If we just pip install tensorflow
, which wheel file is actually used (assuming we do not have access to installation logs all the time)?
Also note, there can be specific builds of Python packages hosted on a private Python package index. These wheels can be built with options that might not be expressed using wheel tags. If you are using a Python environment (not necessarily a containerized environment), how do you know the provenance of the Python packages installed (without accessing installation logs, or eventually any build configuration)?
Built containerized environments used in this article are available at docker.io/fridex/pip-provenance:
podman pull fridex/pip-provenance:raw
podman pull fridex/pip-provenance:patched
You can follow related discussion about PEP-710 at discuss.python.org.
*Even though the provenance_url.json
files produced by the patched pip keep the hash
key, PEP-710 does not define it. The patched pip implementation uses code that is defined by PEP-610 (direct URL). The hash
key is now deprecated in the direct_url.json
file introduced by PEP-610.
Posted on April 13, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.