Deep Dive š¤æ: Where Does Grype Data Come From?
Patrick Smyth
Posted on November 12, 2024
Grype is a vulnerability scanner for container images and filesystems. It's developed by Anchore and written in Golang. When you point Grype at a container image, it will scan the files and folders on that image, compare what it finds to a database of CVEs (known vulnerabilities), and spit out a report telling you what CVEs have been detected.
We like Grype at Chainguard because it's open source, customizable, and reliable enough to integrate into our CI and CVE remediation workflows. (You can read more about why we like Grype in this post.)
In this article, we'll answer a question that comes up frequently: where does Grype's vulnerability data come from? In the process, we'll take a look at Grype's open data pipe line and do some light analysis of the vulnerability data that Grype uses to scan containers.
How Grype Works
If you haven't used Grype before, here's a brief overview of how it works.
- You point Grype at a container image (or filesystem).
- Grype downloads a fresh instance of its
vulnerability.db
database, then scans the image for specific packages, files, configurations, and so on, building a manifest in the form of a Software Bill of Materials (SBOM) itemizing the software contained in the image. (Under the hood, Grype uses a sister tool, Syft, for this step.) - Grype then compares the specific versions of each package against the vulnerability data in its database.
- Finally, a list of CVEs detected in the image is returned to the user.
For a comprehensive overview of Grype's functionality, check out Using Grype to Scan Container Images for Vulnerabilities on Chainguard Academy.
Grype's Data Sources
Grype relies on a set of upstream providers for its vulnerability data. As of November 2024, the list of providers includes:
Note that the above links are to endpoints where data is provided. Chainguard is one of the upstream providers, and updating scanners like Grype on the fixed status of packages in our upstream OS, Wolfi, is a key element in maintaining the low-to-no CVE status of Chainguard Images).
Grype's vulnerability.db
gets rebuilt daily from data sourced from these upstream providers. To build this database, Grype uses two open source tools, vunnel
and grype-db
. The vunnel
tool downloads, standardizes, and stores vulnerability data from the above upstream providers. Basically, it accesses the various provider endpoints and stores a local vulnerability database and metadata for each provider locally. The grype-db
utility collates this vulnerability data, building a much smaller vulnerability.db
usable by Grype.
Building the Grype Database with vunnel
and grype-db
In this section, we'll try out the vunnel
and grype-db
utilities, building a local vulnerability cache and database.
Since a built-daily vulnerability.db
file gets downloaded every time you run a Grype scan, why would you want to build Grype's vulnerability.db
manually? Building manually is useful if:
- You want to use a subset of upstream sources
- You'd like to integrate other sources to create a custom
vulnerability.db
- You require older Grype schemas
- You'd like to contribute to Grype
- You want to understand more about Grype's upstream providers and data structure
vunnel
Thegrype-db
utility uses vunnel
under the hood, but let's first try out vunnel
explicitly to see how it works. You'll need Python 3 installed for this section, and I'll assume it's accessible on your system using the python
command. (vunnel
is written in Python.)
First, let's create a project folder:
mkdir -p ~/vulnerability-data && cd $_
Next, create a virtual environment and activate it:
python -m venv venv && source venv/bin/activate
Now install vunnel
to the activated virtual environment:
pip install vunnel
Once vunnel
is installed, we can use the vunnel list
command to show the current list of providers:
vunnel list
alpine
amazon
chainguard
...
sles
ubuntu
wolfi
You can download a local cache of provider data for any of these providers with the following (using chainguard
as an example provider):
vunnel run chainguard
This creates a data
folder where all provider data used as input and the standardized output database are contained:
data
āāā chainguard
āāā checksums
āāā input
āĀ Ā āāā secdb
āĀ Ā āāā security.json
āāā metadata.json
āāā results
āāā results.db
grype-db
The grype-db
utility can pull provider data with vunnel
under the hood, and can also collect and package up this data into the vulnerability.db
file used by Grype.
In the following, we'll use grype-db
to download all provider data using vunnel
under the hood, then build the vulnerability.db
file. The process of downloading all provider data can take some time and uses about 8 GB of disk space.
First, download the grype-db
script to the project folder we created previously:
curl -sSfL https://raw.githubusercontent.com/anchore/grype-db/main/install.sh | sh -s -- -b .
If you'd like to build from all available data, you'll need a GitHub token capable of authenticating as a user. This is because GitHub rate limits API access for non-authenticated users. You can follow these instructions provided by GitHub, but in short head to this token settings page on GitHub. Remember to safeguard your token as you would a password, and I recommend creating a scoped and short-lived (i.e. 7 days) token.
Once you have your token, create a configuration file for grype-db
in our ~/vulnerability-data
project folder. First, set your generated GitHub token to an environment variable:
GITHUB_TOKEN=ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Next, generate the config in our ~/vulnerability-data
project folder:
cat << EOF > ~/vulnerability-data/.grype-db.yaml
provider:
vunnel:
executor: local
generate-configs: true
env:
GITHUB_TOKEN: $GITHUB_TOKEN
EOF
Now we have the grype-db
script and the .grype-db.yaml
configuration file in our project folder. Let's run a command that will pull all provider data, create a database file, and package it up for inclusion in a CI or other workflow. (For this step, you'll need vunnel
available, so the virtual environment created in the previous section on vunnel will need to still be activated, and you should run this command from the project folder where we downloaded the grype-db
script.)
./grype-db -g
Downloading and processing all provider data can take a long time, possibly hours, so go watch Master of the Flying Guillotine or get some work done, I guess. ā
Once the process completes, a build
folder will have been created in our ~/vulnerability-data/
project folder:
build
āāā listing.json
āāā metadata.json
āāā provider-metadata.json
āāā vulnerability.db
āāā vulnerability-db_v5_2024-11-05T19:09:08Z_1730848219.tar.gz
The vulnerability.db
database should be the same as the database built daily for use by Grype.
Grype's vulnerability.db
I plan on following up this post with an analysis of the data in Grype's vulnerability.db
database, but here are some quick notes on the structure of this SQLite3 file:
When Grype runs, it checks against the last time the database was updated. If it's been longer than a day (Grype rebuilds the database daily) a new vulnerability.db
is downloaded to a cache. - On Linux, it's stored in ~/.cache/grype/db/5/vulnerability.db
, where the numbered folder (5
) corresponds to the current Grype schema version number.) On Mac OS, it's stored in ~/Library/Caches/grype/db
by default.
The vulnerability.db
database has five tables, but only two have significant data. The vulnerability_metadata
table stores information on CVEs as they apply on a per-platform basis. The entities in the vulnerability
table represent vulnerabilities as they apply to specific package versions.
The platforms with the most vulnerability metadata entries are Ubuntu, NVD (NIST's National Vulnerability Database,, and Susa.
While these results are somewhat interesting when considering where Grype data comes from, the number here reflects many factors, mainly the date the provider started recording vulnerabilities and, for platform-specific providers, the attack surface of the platform. Other details such as duplication of distros lower the signal here and would require more analysis to parse out.
We can also check the number of vulnerability metadata entries by year:
This chart mainly shows a movement toward maturity in the ecosystem, leading to some stability after 2016-2017. (The early twenty-teens were a time of rapid development in cloud technologies in particular.)
Digging into the data in Grype's vulnerability.db
can also help to answer much more specific questions about how CVE affect different platforms. For example, imagine we host a mailserver and we wish to know the fixed status of CVE-2024-37383, a known-exploited vulnerability which allows cross-site scripting in RoundCube, a webmail client. We can narrow the data in the vulnerability table to answer this question:
fix_state | namespace |
---|---|
fixed | nvd:cpe |
fixed | debian:distro:debian:11 |
fixed | debian:distro:debian:12 |
fixed | debian:distro:debian:13 |
fixed | debian:distro:debian:unstable |
not-fixed | ubuntu:distro:ubuntu:20.04 |
not-fixed | ubuntu:distro:ubuntu:22.04 |
fixed | ubuntu:distro:ubuntu:23.10 |
In a follow-up post, I'll show you how to load Grype's vulnerability.db
database into Pandas to get a better sense of Grype's data schema and how it can be used to answer specific questions in platform security and broader CVE trends.
Conclusion
One of the most remarkable aspects of the Grype image scanner is the openness of its data pipeline. This is great for transparency and makes Grype's vulnerability.db
into a more flexible and useful tool. If you've found this post useful, let us know and maybe you'll see more security deep dives like this. š”ļøš¤æ You can follow Chainguard on dev.to or LinkedIn or sign up for our newsletter to keep in touch.
Resources
Posted on November 12, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.