Discover Hidden Secrets in Git Repos with Rust
John Naylor
Posted on November 8, 2021
Table of Contents
- Intro
- Git Blobs
- Secrets Test
- Conclusion
Intro
It's no secret to we something, every once in a while, on occasion, accidently commit a password or api key into git. And it's not secret that instead of doing the right thing, we just delete the password or api key, recommit, and move on with our day only to forget it ever happened. Well, today we're going to find out how easy it is uncover that bad behavior and hopefully convince you to reconsider next time.
Git 101
For this task, we'll need to know a litle bit about how git works under the hood. While this isn't necessarily important for writing the code (you can 100% program this yourself by just Googling stuff and coping+pasting stackoverflow code), it'll for sure make the code easier to read and understand.
Git is a content-addressable filesystem. It means that at the core of Git is a simple key-value data store. What this means is that you can insert any kind of content into a Git repository, for which Git will hand you back a unique key you can use later to retrieve that content. (source)
What git really is under the hood is a database of objects (objects in this case refering to unstructured content like files). When you make changes to a project and commit those changes, the changed files are stored as objects along with a bunch of other fancy metadata git uses on top of that. Objects can be either tags, commits, trees, or blobs. Tags give labels to commits, commits contain metadata (e.g. timestamp, author, etc.), trees are like file and directory names that hold information about the project structure, blobs are the actual content of the files. Both of these object types are stored in git's object database (which you can see for yourself by looking in .git/objects
)
Git in Rust
It's no secret that git is written in C which means there is plenty of support for C library like libgit2
. Luckly, there are rust bindings for libgit2
in the crate git2 which gives us a nice high level API with types like Repository
, Commit
, Object
, Odb
, etc. that we can leverage to search for secrets.
Secrets
An important note, when I say secrets, I mean any sensitive information that you wouldn't want to commit to a public repo (e.g. passwords, api keys, mother's maiden name, etc.). Obviously, we can't check for everything so for this, we'll be checking for a subset of secrets, namely:
[
"AWS API Key",
"Facebook Oauth",
"Generic API Key",
"Generic Secret",
"GitHub",
"Google (GCP) Service-account",
"Google Oauth",
"Heroku API Key",
"PGP private key block",
"Password in URL",
"RSA private key",
"SSH (DSA) private key",
"SSH (EC) private key",
"SSH (OPENSSH) private key",
"Slack Token",
"Slack Webhook",
"Twilio API Key",
"Twitter Oauth"
]
Projects Overview
With all the background knowledge in place, we can combine to create the basic flow for the project. We're going to be giving our utility a git repo as an input, we'll need to get the object database from the git repo, get all of the objects that are blobs, and check each blob for secrets. To check for secrets, we'll use regex and check for matches in the blob's content.
# Pseudocode
secrets = [...]
repo = some_git_repo
objectDB = repo.objects
for object in objectDB
for secret in secrets
if object.type is Blob && secret in object.content
print secret.type
end
end
end
The final source code can be found here: https://github.com/jonaylor89/JAZ
Setup
Before starting, go ahead and create a new rust projects with cargo
.
cargo new secret_catcher
Scanning Git Blobs
Looking at our pseudocode, it looks like the first thing we need to do it figure our how to ingest a git repository, read its object DB, and somehow iterate over it. As mentioned above, for this, we'll be using the git2
crate so the first set is adding that to our Cargo.toml
[package]
name = "secret_catcher"
version = "0.0.1"
authors = ["John Naylor <my@email.con>"]
edition = "2021"
description = "Find secrets hidden in commits"
repository = "https://github.com/jonaylor89/JAZ"
license = "MIT"
readme = "README.md"
[dependencies]
git2 = "0.13"
Moving to the code, let's start by injesting the git repo:
use git2::Repository;
fn main() {
// Get path to git repo via command line args or assume current directory
let repo_root: String = std::env::args().nth(1).unwrap_or_else(|| ".".to_string());
// Open git repo
let repo = Repository::open(repo_root.as_str()).expect("Couldn't open repository");
println!(
"{} state={:?}",
repo.path().display(),
repo.state()
);
}
To make things easier, we can test our code using the git repo for this project:
$ cargo run
/home/johannes/Repos/secrets_catcher/.git/ state=Clean
With that bit of code, we're using git2
's Repository
type to serial the git repo. What we need now it to extract the object DB from it. Fortunuately for us, git2
has some methods that'd make that trivial.
use git2::Repository;
fn main() {
// Get path to git repo via command line args or assume current directory
let repo_root: String = std::env::args().nth(1).unwrap_or(".".to_string());
// Open git repo
let repo = Repository::open(repo_root.as_str()).expect("Couldn't open repository");
println!(
"{} state={:?}",
repo.path().display(),
repo.state()
);
// Get object database from the repo
let odb = repo.odb().unwrap();
// Loop through objects in db
odb.foreach(|oid| {
println!("{}", oid);
// Return true because the closure has to return a boolean
true
})
.unwrap();
}
Gotta love when things are simple (can't beat Rust's no-cost abstraction)
The code above compiles to print all of the git objects to the console. By leveraging the method odb()
for the Repository
type, we can a git2
Odb
type, that contains a very handle foreach()
method. With the foreach()
method, supply a closure that will soon contain the code for seeing if that git object contains any secrets.
For reabilitiy purposes, let's put the code for scanning the object into its own function:
odb.foreach(|oid| {
// Get the object from the object's id
let obj = repo.revparse_single(&oid.to_string()).unwrap();
// Look for secrets in the object
scan_object(&obj, &oid);
// Return true because the closure has to return a boolean
true
})
.unwrap();
With the function looking something like this:
use git2::{ObjectType, Object, Oid, Repository};
use std::str::from_utf8;
...
fn main() {...}
...
fn scan_object(obj: &Object, oid: &Oid) {
if let Some(ObjectType::Blob) = obj.kind() {
let blob_str = match from_utf8(obj.as_blob().unwrap().content()) {
Ok(x) => x,
Err(_) => return,
};
// println!("{}",blob_str);
// Check if the blob contains secrets
// TODO: check for secrets in commit here!!!
}
}
Raw git objects require a little bit of preprocessing before we can check if they contain secrets as you can see above. First we use obj.kind()
to see if we're dealing with a Blob
because that's the only type of git object we care about. Any other object type can be discarded. Next, to extract the raw Blob
string, we use the as_blob()
method in conjection with the from_utf8()
, found in the standard library, to finally get a raw Rust string containing the Blob
information.
Again, for readability, let's put the code for actually determining if a Blob
has a secret in it, into its own function.
fn scan_object(obj: &Object, oid: &Oid) {
if let Some(ObjectType::Blob) = obj.kind() {
let blob_str = match std::str::from_utf8(obj.as_blob().unwrap().content()) {
Ok(x) => x,
Err(_) => return,
};
// println!("{}",blob_str);
// Check if the blob contains secrets
if let Some(secrets_found) = find_secrets(blob_str) {
for bad in secrets_found {
println!(
"object {} has a secret of type `{}`",
oid,
bad
);
}
}
}
}
// is_bad : if secrets are found in blob then they are returned as a vector, otherwise return None
fn find_secrets(blob: &str) -> Option<Vec<String>> {
None
}
In the updated scan_objects()
function, we're passing the Blob
content to a newly created function find_secrets()
, which we'll be filling in later, and printing a short message to the console if that Blob
did, in fact, contain secrets.
For a little bit of flair, I went ahead and added some color to the output of the print statements. This is competely optional but it definitely makes it look cooler and is a tad more pleasing to the eyes.
use ...;
macro_rules! info {
() => {
format!("{}[INFO]{}", "\x1B[32m", "\x1B[0m")
};
}
macro_rules! critical {
() => {
format!("{}[CRITICAL]{}", "\x1B[31m", "\x1B[0m")
};
}
fn main() {
...
println!(
"{} {} state={:?}",
info!(),
repo.path().display(),
repo.state()
);
...
}
fn scan_object(...) {
...
println!(
"{} object {} has a secret of type `{}`",
critical!(),
oid,
bad
);
...
}
Finding Secrets with Regex
It seems that now, the only other thing that needs to be implemented is the find_secrets()
function that gets called in out scan_objects()
function. There are a few ways to go about doing this but for this project, let's use regex. We'll check every Blob
we get against a set of regex which will tell us if there is a secret and what kind of secret it is. Every "hit" we get, we'll add that secret type to a Vec
and return that Vec
at the end of the function. First, for the exact regex, I found a few other similar projects that use regex to find secrets and frankensteined them into a HashMap which is our set of rules.
fn find_secrets(blob: &str) -> Option<Vec<String>> {
let rules = HashMap::from([
("Slack Token", "(xox[p|b|o|a]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32})"),
("RSA private key", "-----BEGIN RSA PRIVATE KEY-----"),
("SSH (OPENSSH) private key", "-----BEGIN OPENSSH PRIVATE KEY-----"),
("SSH (DSA) private key", "-----BEGIN DSA PRIVATE KEY-----"),
("SSH (EC) private key", "-----BEGIN EC PRIVATE KEY-----"),
("PGP private key block", "-----BEGIN PGP PRIVATE KEY BLOCK-----"),
("Facebook Oauth", "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].{0,30}['\"\\s][0-9a-f]{32}['\"\\s]"),
("Twitter Oauth", "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].{0,30}['\"\\s][0-9a-zA-Z]{35,44}['\"\\s]"),
("GitHub", "[g|G][i|I][t|T][h|H][u|U][b|B].{0,30}['\"\\s][0-9a-zA-Z]{35,40}['\"\\s]"),
("Google Oauth", "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")"),
("AWS API Key", "AKIA[0-9A-Z]{16}"),
("Heroku API Key", "[h|H][e|E][r|R][o|O][k|K][u|U].{0,30}[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}"),
("Generic Secret", "[s|S][e|E][c|C][r|R][e|E][t|T].{0,30}['\"\\s][0-9a-zA-Z]{32,45}['\"\\s]"),
("Generic API Key", "[a|A][p|P][i|I][_]?[k|K][e|E][y|Y].{0,30}['\"\\s][0-9a-zA-Z]{32,45}['\"\\s]"),
("Slack Webhook", "https://hooks.slack.com/services/T[a-zA-Z0-9_]{8}/B[a-zA-Z0-9_]{8}/[a-zA-Z0-9_]{24}"),
("Google (GCP) Service-account", "\"type\": \"service_account\""),
("Twilio API Key", "SK[a-z0-9]{32}"),
("Password in URL", "[a-zA-Z]{3,10}://[^/\\s:@]{3,20}:[^/\\s:@]{3,20}@.{1,100}[\"'\\s]"),
]);
None
}
Next, we'll need to iterate over those rules and see if we get a match from teh blob
parameter passed. If there is a match, we'll push that secret type onto our secrets_found
Vec
that we defined at the start of the loop and return that Vec
at the end (if it has any matches).
fn find_secrets(blob: &str) -> Option<Vec<String>> {
let rules = HashMap::from([...]);
let mut secrets_found = vec![];
for (key, val) in rules {
// Use regex from rules file to match against blob
let re = Regex::new(val).unwrap();
if re.is_match(blob) {
secrets_found.push(key.to_string());
}
}
if secrets_found.is_empty() {
// Return bad commits if there are any
return Some(secrets_found);
}
None
}
And with the find_secrets()
function done, we can go ahead and test our code! Again, the easier way is to just test it against the git repo that you're using for the project:
Conclusion
At this point, we've succeeded at what we set out to create. I went ahead and scanned common testing repositories for this sort of thing like Plazmaz/leaky-repo and dijininja/leakyrepo. In general the program found all or most of the secrets. In the case of
dijininja/leakyrepo it found a lot of RSA private keys which is acceptable but technically a misidentification. For
Plazmaz/leaky-repo we find the majority of the keys although once again misidentify some. The decision to use rust makes performance really solid although still a little slow even for small repos. A couple good extensions to this to help with that could be adding a thread pool in order to scan objects in parallel. In more professional code, it seems more idiomatic for the scan_objects()
function to return some objects of objects including their results rather than just printing the one containing secrets. For example, it could be formatted something like this:
{
objectID1: [
secrets1,
...
],
...
}
In the end, this tool could work as a good starting point for something more sophisticated.
The final source code can be found here: https://github.com/jonaylor89/JAZ
Further Reading
- Git objects: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects
- The almighty rust book: https://doc.rust-lang.org/stable/book/
Posted on November 8, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.