Discover Hidden Secrets in Git Repos with Rust

jonaylor89

John Naylor

Posted on November 8, 2021

Discover Hidden Secrets in Git Repos with Rust

Table of Contents

  • Intro
  • Git Blobs
  • Secrets Test
  • Conclusion

Intro

It's no secret to we something, every once in a while, on occasion, accidently commit a password or api key into git. And it's not secret that instead of doing the right thing, we just delete the password or api key, recommit, and move on with our day only to forget it ever happened. Well, today we're going to find out how easy it is uncover that bad behavior and hopefully convince you to reconsider next time.

Git 101

For this task, we'll need to know a litle bit about how git works under the hood. While this isn't necessarily important for writing the code (you can 100% program this yourself by just Googling stuff and coping+pasting stackoverflow code), it'll for sure make the code easier to read and understand.

Git is a content-addressable filesystem. It means that at the core of Git is a simple key-value data store. What this means is that you can insert any kind of content into a Git repository, for which Git will hand you back a unique key you can use later to retrieve that content. (source)

What git really is under the hood is a database of objects (objects in this case refering to unstructured content like files). When you make changes to a project and commit those changes, the changed files are stored as objects along with a bunch of other fancy metadata git uses on top of that. Objects can be either tags, commits, trees, or blobs. Tags give labels to commits, commits contain metadata (e.g. timestamp, author, etc.), trees are like file and directory names that hold information about the project structure, blobs are the actual content of the files. Both of these object types are stored in git's object database (which you can see for yourself by looking in .git/objects )

Image description

Git in Rust

It's no secret that git is written in C which means there is plenty of support for C library like libgit2. Luckly, there are rust bindings for libgit2 in the crate git2 which gives us a nice high level API with types like Repository, Commit, Object, Odb, etc. that we can leverage to search for secrets.

Secrets

An important note, when I say secrets, I mean any sensitive information that you wouldn't want to commit to a public repo (e.g. passwords, api keys, mother's maiden name, etc.). Obviously, we can't check for everything so for this, we'll be checking for a subset of secrets, namely:

[
  "AWS API Key",
  "Facebook Oauth",
  "Generic API Key",
  "Generic Secret",
  "GitHub",
  "Google (GCP) Service-account",
  "Google Oauth",
  "Heroku API Key",
  "PGP private key block",
  "Password in URL",
  "RSA private key",
  "SSH (DSA) private key",
  "SSH (EC) private key",
  "SSH (OPENSSH) private key",
  "Slack Token",
  "Slack Webhook",
  "Twilio API Key",
  "Twitter Oauth"
]
Enter fullscreen mode Exit fullscreen mode

Projects Overview

With all the background knowledge in place, we can combine to create the basic flow for the project. We're going to be giving our utility a git repo as an input, we'll need to get the object database from the git repo, get all of the objects that are blobs, and check each blob for secrets. To check for secrets, we'll use regex and check for matches in the blob's content.

# Pseudocode
secrets = [...]
repo = some_git_repo
objectDB = repo.objects

for object in objectDB
    for secret in secrets 
        if object.type is Blob && secret in object.content
            print secret.type
        end
    end
end
Enter fullscreen mode Exit fullscreen mode

The final source code can be found here: https://github.com/jonaylor89/JAZ

Setup

Before starting, go ahead and create a new rust projects with cargo.

cargo new secret_catcher
Enter fullscreen mode Exit fullscreen mode

Scanning Git Blobs

Looking at our pseudocode, it looks like the first thing we need to do it figure our how to ingest a git repository, read its object DB, and somehow iterate over it. As mentioned above, for this, we'll be using the git2 crate so the first set is adding that to our Cargo.toml

 [package]
name = "secret_catcher"
version = "0.0.1"
authors = ["John Naylor <my@email.con>"]
edition = "2021"
description = "Find secrets hidden in commits"
repository = "https://github.com/jonaylor89/JAZ"
license = "MIT"
readme = "README.md"

[dependencies]
git2 = "0.13"
Enter fullscreen mode Exit fullscreen mode

Moving to the code, let's start by injesting the git repo:

use git2::Repository;

fn main() {

    // Get path to git repo via command line args or assume current directory
    let repo_root: String = std::env::args().nth(1).unwrap_or_else(|| ".".to_string());

    // Open git repo
    let repo = Repository::open(repo_root.as_str()).expect("Couldn't open repository");

    println!(
        "{} state={:?}",
        repo.path().display(),
        repo.state()
    );
}
Enter fullscreen mode Exit fullscreen mode

To make things easier, we can test our code using the git repo for this project:

$ cargo run
/home/johannes/Repos/secrets_catcher/.git/ state=Clean
Enter fullscreen mode Exit fullscreen mode

With that bit of code, we're using git2's Repository type to serial the git repo. What we need now it to extract the object DB from it. Fortunuately for us, git2 has some methods that'd make that trivial.

use git2::Repository;

fn main() {

    // Get path to git repo via command line args or assume current directory
    let repo_root: String = std::env::args().nth(1).unwrap_or(".".to_string());

    // Open git repo
    let repo = Repository::open(repo_root.as_str()).expect("Couldn't open repository");

    println!(
        "{} state={:?}",
        repo.path().display(),
        repo.state()
    );

        // Get object database from the repo
    let odb = repo.odb().unwrap();

    // Loop through objects in db
    odb.foreach(|oid| {
                println!("{}", oid);

        // Return true because the closure has to return a boolean
        true
    })
    .unwrap();
}
Enter fullscreen mode Exit fullscreen mode

Gotta love when things are simple (can't beat Rust's no-cost abstraction)

The code above compiles to print all of the git objects to the console. By leveraging the method odb() for the Repository type, we can a git2 Odb type, that contains a very handle foreach() method. With the foreach() method, supply a closure that will soon contain the code for seeing if that git object contains any secrets.

For reabilitiy purposes, let's put the code for scanning the object into its own function:

odb.foreach(|oid| {
        // Get the object from the object's id
    let obj = repo.revparse_single(&oid.to_string()).unwrap();

    // Look for secrets in the object
    scan_object(&obj, &oid);

    // Return true because the closure has to return a boolean
    true
})
.unwrap();
Enter fullscreen mode Exit fullscreen mode

With the function looking something like this:

use git2::{ObjectType, Object, Oid, Repository};
use std::str::from_utf8;

...
fn main() {...}
...

fn scan_object(obj: &Object, oid: &Oid) {

    if let Some(ObjectType::Blob) =  obj.kind() {
        let blob_str = match from_utf8(obj.as_blob().unwrap().content()) {
            Ok(x) => x,
            Err(_) => return,
        };
        // println!("{}",blob_str);

        // Check if the blob contains secrets
        // TODO: check for secrets in commit here!!!
    }
}
Enter fullscreen mode Exit fullscreen mode

Raw git objects require a little bit of preprocessing before we can check if they contain secrets as you can see above. First we use obj.kind() to see if we're dealing with a Blob because that's the only type of git object we care about. Any other object type can be discarded. Next, to extract the raw Blob string, we use the as_blob() method in conjection with the from_utf8() , found in the standard library, to finally get a raw Rust string containing the Blob information.

Again, for readability, let's put the code for actually determining if a Blob has a secret in it, into its own function.

fn scan_object(obj: &Object, oid: &Oid) {

    if let Some(ObjectType::Blob) = obj.kind() {
        let blob_str = match std::str::from_utf8(obj.as_blob().unwrap().content()) {
            Ok(x) => x,
            Err(_) => return,
        };
        // println!("{}",blob_str);

        // Check if the blob contains secrets
        if let Some(secrets_found) = find_secrets(blob_str) {
            for bad in secrets_found {
                println!(
                    "object {} has a secret of type `{}`",
                    oid,
                    bad
                );
            }
        }
    }
}

// is_bad : if secrets are found in blob then they are returned as a vector, otherwise return None
fn find_secrets(blob: &str) -> Option<Vec<String>> {
    None
}
Enter fullscreen mode Exit fullscreen mode

In the updated scan_objects() function, we're passing the Blob content to a newly created function find_secrets() , which we'll be filling in later, and printing a short message to the console if that Blob did, in fact, contain secrets.

For a little bit of flair, I went ahead and added some color to the output of the print statements. This is competely optional but it definitely makes it look cooler and is a tad more pleasing to the eyes.

use ...;

macro_rules! info {
    () => {
        format!("{}[INFO]{}", "\x1B[32m", "\x1B[0m")
    };
}

macro_rules! critical {
    () => {
        format!("{}[CRITICAL]{}", "\x1B[31m", "\x1B[0m")
    };
}

fn main() {
    ...
    println!(
      "{} {} state={:?}",
      info!(),
      repo.path().display(),
      repo.state()
  );
    ...
}

fn scan_object(...) {
    ...
    println!(
     "{} object {} has a secret of type `{}`",
      critical!(),
      oid,
      bad
  );
    ...
}
Enter fullscreen mode Exit fullscreen mode

Finding Secrets with Regex

It seems that now, the only other thing that needs to be implemented is the find_secrets() function that gets called in out scan_objects() function. There are a few ways to go about doing this but for this project, let's use regex. We'll check every Blob we get against a set of regex which will tell us if there is a secret and what kind of secret it is. Every "hit" we get, we'll add that secret type to a Vec and return that Vec at the end of the function. First, for the exact regex, I found a few other similar projects that use regex to find secrets and frankensteined them into a HashMap which is our set of rules.

fn find_secrets(blob: &str) -> Option<Vec<String>> {
    let rules = HashMap::from([
      ("Slack Token", "(xox[p|b|o|a]-[0-9]{12}-[0-9]{12}-[0-9]{12}-[a-z0-9]{32})"),
      ("RSA private key", "-----BEGIN RSA PRIVATE KEY-----"),
      ("SSH (OPENSSH) private key", "-----BEGIN OPENSSH PRIVATE KEY-----"),
      ("SSH (DSA) private key", "-----BEGIN DSA PRIVATE KEY-----"),
      ("SSH (EC) private key", "-----BEGIN EC PRIVATE KEY-----"),
      ("PGP private key block", "-----BEGIN PGP PRIVATE KEY BLOCK-----"),
      ("Facebook Oauth", "[f|F][a|A][c|C][e|E][b|B][o|O][o|O][k|K].{0,30}['\"\\s][0-9a-f]{32}['\"\\s]"),
      ("Twitter Oauth", "[t|T][w|W][i|I][t|T][t|T][e|E][r|R].{0,30}['\"\\s][0-9a-zA-Z]{35,44}['\"\\s]"),
      ("GitHub", "[g|G][i|I][t|T][h|H][u|U][b|B].{0,30}['\"\\s][0-9a-zA-Z]{35,40}['\"\\s]"),
      ("Google Oauth", "(\"client_secret\":\"[a-zA-Z0-9-_]{24}\")"),
      ("AWS API Key", "AKIA[0-9A-Z]{16}"),
      ("Heroku API Key", "[h|H][e|E][r|R][o|O][k|K][u|U].{0,30}[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}"),
      ("Generic Secret", "[s|S][e|E][c|C][r|R][e|E][t|T].{0,30}['\"\\s][0-9a-zA-Z]{32,45}['\"\\s]"),
      ("Generic API Key", "[a|A][p|P][i|I][_]?[k|K][e|E][y|Y].{0,30}['\"\\s][0-9a-zA-Z]{32,45}['\"\\s]"),
      ("Slack Webhook", "https://hooks.slack.com/services/T[a-zA-Z0-9_]{8}/B[a-zA-Z0-9_]{8}/[a-zA-Z0-9_]{24}"),
      ("Google (GCP) Service-account", "\"type\": \"service_account\""),
      ("Twilio API Key", "SK[a-z0-9]{32}"),
      ("Password in URL", "[a-zA-Z]{3,10}://[^/\\s:@]{3,20}:[^/\\s:@]{3,20}@.{1,100}[\"'\\s]"),
  ]);

    None
}
Enter fullscreen mode Exit fullscreen mode

Next, we'll need to iterate over those rules and see if we get a match from teh blob parameter passed. If there is a match, we'll push that secret type onto our secrets_found Vec that we defined at the start of the loop and return that Vec at the end (if it has any matches).

fn find_secrets(blob: &str) -> Option<Vec<String>> {
    let rules = HashMap::from([...]);

    let mut secrets_found = vec![];
    for (key, val) in rules {
        // Use regex from rules file to match against blob
        let re = Regex::new(val).unwrap();
        if re.is_match(blob) {
            secrets_found.push(key.to_string());
        }
    }
    if secrets_found.is_empty() {
        // Return bad commits if there are any
        return Some(secrets_found);
    }
    None
}
Enter fullscreen mode Exit fullscreen mode

And with the find_secrets() function done, we can go ahead and test our code! Again, the easier way is to just test it against the git repo that you're using for the project:

Image description

Conclusion

At this point, we've succeeded at what we set out to create. I went ahead and scanned common testing repositories for this sort of thing like Plazmaz/leaky-repo and dijininja/leakyrepo. In general the program found all or most of the secrets. In the case of
dijininja/leakyrepo it found a lot of RSA private keys which is acceptable but technically a misidentification. For
Plazmaz/leaky-repo we find the majority of the keys although once again misidentify some. The decision to use rust makes performance really solid although still a little slow even for small repos. A couple good extensions to this to help with that could be adding a thread pool in order to scan objects in parallel. In more professional code, it seems more idiomatic for the scan_objects() function to return some objects of objects including their results rather than just printing the one containing secrets. For example, it could be formatted something like this:

{
    objectID1: [
        secrets1,
        ...
    ],
    ...
}
Enter fullscreen mode Exit fullscreen mode

In the end, this tool could work as a good starting point for something more sophisticated.

The final source code can be found here: https://github.com/jonaylor89/JAZ

Further Reading

💖 💪 🙅 🚩
jonaylor89
John Naylor

Posted on November 8, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

What was your win this week?
weeklyretro What was your win this week?

November 29, 2024

Where GitOps Meets ClickOps
devops Where GitOps Meets ClickOps

November 29, 2024

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024

Modern C++ for LeetCode 🧑‍💻🚀
leetcode Modern C++ for LeetCode 🧑‍💻🚀

November 29, 2024