Slimming Down Repo-1
Subbu Lakshmanan
Posted on January 1, 2023
Most of the repository providers (Github, Bitbucket) have a size limitation on the repository size and mechanisms to store large files using LFS options. In most cases, a repository may not reach these typical max size limits enforced. However, it can happen over a long period unless certain measures are taken to reduce the file size.
Disclaimer:
- There are a lot of articles on what to commit to a repository and what to ignore. The assumption is that you have followed the guidelines not to store any runtime/build/temporary files, yet reach the size limit.
- There are a few commands that can permanently delete the data. The suggested approach to a clean-up is to create a new clone of the repo, execute the clean-up procedure and verify the results. If you are happy with what you have arrived at, then push the changes to your remote.
- Since the process involves re-writing history, you will need to 'force' push the changes to remote. So make sure to perform these commands with the approval of the team and only when it's required (i.e., The Git repo storage limit is reached).
These are a few commands to identify the biggest files that could be potential files to remove. You can remove these files and commit, but this will not remove the pack files associated with the original commits.
One way to remove the large files and the pack files associated with them is by using the git filter-branch
along with the git reflog & git gc
command.
Before beginning, Here are a few commands that we use to identify and remove bigger files.
- To identify the size of the repo
du -scH
- To identify the count of files & other info
git count-objects -vH
- Identify the top 10 big files
ls -ld -- **/*(DOL[1,10])
- Identify the commits with the largest 'n' blobs (Sort ascending and list the last 'n' items)
git verify-pack -v .git/objects/pack/<pack-name>.idx | sort -k 3 -n | tail -n 2
- List the files in a commit
git rev-list --objects --all | grep <commit-id>
- Remove the file and re-write the history
git filter-branch --index-filter 'git rm --cached --ignore-unmatch <File-to-be-removed>' --tag-name-filter cat -- --all
- To prune older reflog entries
git reflog expire --expire=now --all
- To perform clean up unnecessary files and optimize the local repository
git gc --prune=now
Here's a sequence of actions I performed in one of my repositories to reduce the storage. (I didn't reach the storage limit, I performed these steps to demo the idea)
Identify
Identify the biggest file in the repo
▶ du -scH
257M .
257M total
▶ git verify-pack -v .git/objects/pack/pack-3c30d356e18bda774eb13dc9e53929012ec06800.idx | sort -k 3 -n | tail -n 2
f5ed007fc5ee61733ee9bec25fdeac3f0119644f blob 12362185 12166055 61415659
de3e5b333ba453655951cabdae20588419ef7fe0 blob 18025235 18030628 35112687
▶ git rev-list --objects --all | grep f5ed007fc5ee61733ee9bec25fdeac3f0119644f
f5ed007fc5ee61733ee9bec25fdeac3f0119644f Side_Projects/MemoryTiles/google-play-screenshots-v1-todoriliev.com.sketch/Data
Removal
▶ git filter-branch --index-filter 'git rm --cached --ignore-unmatch Side_Projects/MemoryTiles/google-play-screenshots-v1-todoriliev.com.sketch' --tag-name-filter cat -- --all
WARNING: git-filter-branch has a glut of gotchas generating mangled history
rewrites. Hit Ctrl-C before proceeding to abort, then use an
alternative filtering tool such as 'git filter-repo'
(https://github.com/newren/git-filter-repo/) instead. See the
filter-branch manual page for more details; to squelch this warning,
set FILTER_BRANCH_SQUELCH_WARNING=1.
Proceeding with filter-branch...
...
...
Ref 'refs/heads/main' was rewritten
Ref 'refs/remotes/origin/ESI_Archive' was rewritten
Ref 'refs/remotes/origin/main' was rewritten
WARNING: Ref 'refs/remotes/origin/main' is unchanged
Ref 'refs/stash' was rewritten
Clean-up
The git filter-branch command will create backup refs in .git/refs/original
. These refs must be deleted in order to remove references to these objects. Also, it's good to perform a garbage collection to do some clean-up.
▶ git for-each-ref --format="%(refname)" refs/original/ | while read ref; do git update-ref -d $ref; done
▶ git reflog expire --expire=now --all
▶ git gc --prune=now
Enumerating objects: 7802, done.
Counting objects: 100% (7802/7802), done.
Delta compression using up to 10 threads
Compressing objects: 100% (3950/3950), done.
Writing objects: 100% (7802/7802), done.
Total 7802 (delta 4367), reused 6593 (delta 3732), pack-reused 0
▶ git verify-pack -v .git/objects/pack/pack-358d01cc2715da3d0f49ccea3e5d3352e596e7c0.idx | sort -k 3 -n | tail -n 2
20ca85599c3decf2a972b0ede24ac0a8231b4cd9 blob 7218535 239853 111309837
b9785b3f3b5cddb633e7b2204d08c4bfd32ca501 blob 7608065 7502636 55950582
▶ git push
To github.com:subbramanil/my-dev-notes.git
! [rejected] main -> main (fetch first)
error: failed to push some refs to 'github.com:subbramanil/my-dev-notes.git'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.
Documents/personal/my-dev-notes main ✔ 2h3m ⍉
▶ git push -f
Enumerating objects: 7796, done.
Counting objects: 100% (7796/7796), done.
Delta compression using up to 10 threads
Compressing objects: 100% (3312/3312), done.
Writing objects: 100% (7796/7796), 133.17 MiB | 2.66 MiB/s, done.
Total 7796 (delta 4364), reused 7795 (delta 4364), pack-reused 0
remote: Resolving deltas: 100% (4364/4364), done.
To github.com:subbramanil/my-dev-notes.git
+ 734e92e...71bd9bb main -> main (forced update)
▶ du -scH
212M .
212M total
Reducing a repo size from 257 MB to 212 MB (45 MB) may not look like a big saving, however, the approach can be applied repeatedly to remove the bigger files to reduce the size of the repo.
I found out there there are two alternatives to achieve similar results. I will write a follow-up blog on these two.
-
git filter-repo (As noticed in the command line logs of
git filter-branch
) - BFG
References:
Posted on January 1, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.