Shrinking your git repository with BFG Repo-Cleaner
Jake Carpenter
Posted on January 9, 2022
I recently decided to find a way to reduce the size of a git repository for a project. Previous engineers had committed some relatively large files and it took too long to clone the repository. We deleted the files months ago, but they are buried in history. I found a tool called BFG Repo-Cleaner that makes this incredibly easy. Using it, I was able to decrease the size of our project from around 750MB to under 10MB without losing any valuable history.
In addition to solving my use-case, this tool can also be used if someone has made the mistake of committing secrets/credentials into the repository, which makes knowing how to use this tool a life-saver!
Prerequisites
- Java runtime 8+
- Some instructions below require the use of Bash, but the tool can be used in Windows with any command prompt
Mirror repository
For the best results, a full mirror of the repository is needed. It will be easier if all feature/bug branches are deleted before attempting this. Mirroring will pull the entire repository but will not show editable/working files.
git clone --mirror git://your.server.com/your-big-repo.git
Optional - Identify large files in history
Depending on what needs to be removed from the repository history, knowing which files are the largest can be helpful. I used a helpful script written by Antony Stubbs that can list those.
Create a file called git-large-files
and make it executable.
touch git-large-files
chmod +x git-large-files
Paste in the following Bash script that is slightly modified from Antony's original:
#!/bin/bash
# Shows you the largest objects in your repo's pack file.
# Written for osx.
#
# @see https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/
# @author Antony Stubbs
# set the internal field spereator to line break, so that we can iterate easily over the verify-pack output
IFS=$'\n';
# number of objects to print
count=25
# list all objects including their size, sort by size
objects=`git verify-pack -v ./objects/pack/pack-*.idx | grep -v chain | sort -k3nr | head -n ${count}`
echo "All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file."
output="size,pack,SHA,location"
for y in $objects
do
# extract the size in bytes
size=$((`echo $y | cut -f 5 -d ' '`/1024))
# extract the compressed size in bytes
compressedSize=$((`echo $y | cut -f 6 -d ' '`/1024))
# extract the SHA
sha=`echo $y | cut -f 1 -d ' '`
# find the objects location in the repository tree
other=`git rev-list --all --objects | grep $sha`
#lineBreak=`echo -e "\n"`
output="${output}\n${size},${compressedSize},${other}"
done
echo -e $output | column -t -s ', '
Now execute this script from the repository. It will list the largest 25 files in your entire history. These files can be specifically targeted at a later step.
cd your-big-repo.git
../git-large-files
Approach 1 - Deleting files larger than specific size
One approach is to allow the tool to find and clean files larger specific size. To strip files over 20MB, for example, execute the following:
java -jar /path/to/bfg.jar --strip-blobs-bigger-than 20M your-big-repo.git
The tool will output a report as it executes a list of the deleted files. Always review this list. Next, the garbage collector needs to run to actually delete those files. Do this before attempting to run the tool again with other parameters:
cd your-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
Approach 2 - Deleting matching files
Another common usage is to delete specific filename(s). This is especially useful when following a previous step that identified the largest files in your repository.
// Delete a single file
java -jar /path/to/bfg.jar --delete-files 'some-image-that-was-not-needed.png' your-big-repo.git
// Delete many matching files
java -jar /path/to/bfg.jar --delete-files '{*.apk,*.app,yarn.lock}' your-big-repo.git
// Delete a folder
java -jar /path/to/bfg.jar --delete-folders 'build' your-big-repo.git
Like using the other approach, the tool will output a report as it executes which includes a list of the deleted files. Always review this list. Next, the garbage collector needs to run to actually delete those files. Do this before attempting to run the tool again with other parameters:
cd your-big-repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
Optional - Include the latest commit when cleaning (NOT RECOMMENDED)
The previous examples utilize the tool's default behavior of ignoring all files on your current commit. While it is safer to delete any current files manually then run this tool, you can opt to include the current commit.
java -jar /path/to/bfg.jar --no-blob-protection --delete-files 'file-still-in-HEAD.png' repo.git
Override the remote repository
At this point there should be a significant size of the repository between now and before this tool was used. To make this change permanent, though, the changes to history need to override what exists on the server. A few notes:
- Do not allow team-members to check in additional changes from their local repositories. The entire team will need to re-clone these changes.
- Consider pushing these changes to another remote repository to smoke test. Start it, run tests, and do whatever else needed to feel confident the code is still in working condition.
- Before overriding the remote repository, make another backup of the repository somewhere safe using
git clone --mirror ...
just in-case.
Once ready to override the remote repository, force push these changes:
git push --force
Finally, change to another directory and re-clone the repository using the standard approach.
Posted on January 9, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024