Ivan Shcheklein
Posted on July 19, 2020
It's not a secret, Git doesn't handle large files well:
Indeed. The git architecture simply sucks for big objects. It was discussed somewhat during the early stages, but a lot of it really is pretty fundamental. (Linus Torvalds)
In this short post I'd like to:
- See what tools are available there to handle large files with Git
- Try one of those - DVC
Have you ever committed a few 100 MBs file to then realize it's part of the repo now and it would take quite an effort to carve it out and fix the repo:
Git clone takes hours, regular operations might take minutes instead of seconds - not the best idea indeed. And still, there are a lot of cases where we want to have a large file versioned in our repo - from game development to data science where we want to handle large datasets, videos, etc.
So, let's see what open-source and Git-compatible options do we have to deal with this:
Git-LFS - Github and Gitlab both support it and can store large files on their servers for you, with some limits
Git-annex - pretty powerful and sophisticated tool, but it makes it hard to learn and manage to my mind
DVC - Git for Data or Data Version Control - a tool made for ML and data projects, but on its fundamental level helps versioning large files
You can read (a somewhat outdated) overview of LFS and annex tools here, but this time I want to show you how the workflow looks like with DVC (yes! I'm one of the maintainers).
After DVC is installed all we need to do is to run dvc add
and set a storage you'd like to use to store your large files.
Let's try it right here and there, first we need a dummy repo:
$ mkdir example
$ cd example
$ git init
$ dvc init
$ git commit -m "initialize"
Second, generate a large file:
$ head -c1000000 /dev/urandom > large-file
# Windows: fsutil file large-file test.txt 1048576
The workflow is similar to Git, but instead of git add
and git push
we run dvc add
and dvc push
when we want to save a large file:
$ dvc add large-file
Now, let's save it somewhere (we use Google Drive here, but it can be AWS S3, Google Cloud, local directory, and many other storage options):
$ dvc remote add -d mystorage gdrive://root/Storage
$ dvc push
You'd need to create the
Storage
directory in your Google Drive UI first anddvc push
will ask you to give it access to your storage. It is absolutely safe! - credentials are saved on your local machine in the.dvc/tmp/gdrive-user-credentials.json
, no access given outside.
Now, we can do git commit
to save DVC files instead of a large file itself (you can run dvc status
to see that large-file
is not handled and visible by Git anymore):
$ git add .
$ git status
On branch master
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: .dvc/config
new file: .gitignore
new file: large-file.dvc
$ git commit -a -m "add large file"
That's it for today, next time we'll see how did it work, what does large-file.dvc
mean, why does it create .gitignore
and how can we get our file back!
Posted on July 19, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.